Prevent overflow/underflow in float division - c

I have two numbers:
FL_64 variable_number;
FL_64 constant_number;
The constant number is always the same, for example:
constant_number=(FL_64)0.0000176019966602325;
The variable number is given to me and I need to perform the division:
FL_64 result = variable_number/constant_number;
What would be the checks I need to do to variable_number in order to make sure the operation will not overflow / underflow before performing it?
Edit: FL_64 is just a typedef for double so FL_64 = double.

A Test For Overflow
Assume:
The C implementation uses IEEE-754 arithmetic with round-to-nearest-ties-to-even.
The magnitude of the divisor is at most 1, and the divisor is non-zero.
The divisor is positive.
The test and the proof below are written with the above assumptions for simplicity, but the general cases are easily handled:
If the divisor might be negative, use fabs(divisor) in place of divisor when calculating the limit shown below.
If the divisor is zero, there is no need to test for overflow, as it is already known an error (divide-by-zero) occurs.
If the magnitude exceeds 1, the division never creates a new overflow. Overflow occurs only if the dividend is already infinity (so a test would be isinf(candidate)). (With a divisor exceeding 1 in magnitude, the division could underflow. This answer does not discuss testing for underflow in that case.)
Note about notation: Expressions using non-code-format operators, such as x•y, represent exact mathematical expressions, without floating-point rounding. Expressions in code format, such as x*y, mean the computed results with floating-point rounding.
To detect overflow when dividing by divisor, we can use:
FL_64 limit = DBL_MAX * divisor;
if (-limit <= candidate && candidate <= limit)
// Overflow will not occur.
else
// Overflow will occur or candidate or divisor is NaN.
Proof:
limit will equal DBL_MAX multiplied by divisor and rounded to the nearest representable value. This is exactly DBL_MAX•divisor•(1+e) for some error e such that −2−53 ≤ e ≤ 2−53, by the properties of rounding to nearest plus the fact that no representable value for divisor can, when multiplied by DBL_MAX, produce a value below the normal range. (In the subnormal range, the relative error due to rounding could be greater than 2−53. Since the product remains in the normal range, that does not occur.)
However, e = 2−53 can occur only if the exact mathematical value of DBL_MAX•divisor falls exactly midway between two representable values, thus requiring it to have 54 significant bits (the bit that is ½ of the lowest position of the 53-bit significand of representable values is the 54th bit, counting from the leading bit). We know the significand of DBL_MAX is 1fffffffffffff16 (53 bits). Multiplying it by odd numbers produces 1fffffffffffff16 (when multiplied by 1), 5ffffffffffffd16 (by 3), and 0x9ffffffffffffb16 (by 5), and numbers with more significant bits when multiplied by greater odd numbers. Note that 5ffffffffffffd16 has 55 significant bits. None of these has exactly 54 significant bits. When multiplied by even numbers, the product has trailing zeros, so the number of significant bits is the same as when multiplying by the odd number that results from dividing the even number by the greatest power of two that divides it. Therefore, no product of DBL_MAX is exactly midway between two representable values, so the error e is never exactly 2−53. So −253 < e < 2−53.
So, limit = DBL_MAX•divisor•(1+e), where e < 2−53. Therefore limit/divisor is DBL_MAX•(1+e). Since this result is less than ½ ULP from DBL_MAX, it never rounds up to infinity, so it never overflows. So dividing any candidate that is less than or equal to limit by divisor does not overflow.
Now we will consider candidates exceeding limit. As with the upper bound, e cannot equal −2−53, for the same reason. Then the least e can be is −2−53 + 2−105, because the product of DBL_MAX and divisor has at most 106 significant bits, so any increase from the midpoint between two representable values must be by at least one part in 2−105. Then, if limit < candidate, candidate is at least one part in 2−52 greater than limit, since there are 53 bits in a significand. So DBL_MAX•divisor•(1−2−53+2−105)•(1+2−52) < candidate. Then candidate/divisor is at least DBL_MAX•(1−2−53+2−105)•(1+2−52), which is DBL_MAX•(1+2−53+2−157). The exceeds the midpoint between DBL_MAX and what would be the next representable value if the exponent range were unbounded, which is the basis for the IEEE-754 rounding criterion. Therefore, it rounds up to infinity, so overflow occurs.
Underflow
Dividing by a number with magnitude less than one of course makes a number larger in magnitude, so it never underflows to zero. However, the IEEE-754 definition of underflow is that a non-zero result is tiny (in the subnormal range), either before or after rounding (whether to use before or after is implementation-defined). It is of course possible that dividing a subnormal number by a divisor less than one will produce a result still in the subnormal range. However, for this to happen, underflow must have occurred previously, to get the subnormal dividend in the first place. Therefore, underflow will never be introduced by a division by a number with magnitude less than one.
If one does wish to test for this underflow, one might similarly to the test for overflow—by comparing the candidate to the minimum normal (or the greatest subnormal) multiplied by divisor—but I have not yet worked through the numerical properties.

Assuming FL_64 is something like a double you can get the maximum value which is named DBL_MAX from float.h
So you want to make sure that
DBL_MAX >= variable_number/constant_number
or equally
DBL_MAX * constant_number >= variable_number
In code that could be something like
if (constant_number > 0.0 && constant_number < 1.0)
{
if (DBL_MAX * constant_number >= variable_number)
{
// wont overflow
}
else
{
// will overflow
}
}
else
{
// add code for other ranges of constant_number
}
However, notice that floating point calculations are imprecise so there maybe corner cases where the above code will fail.

I'm going to attempt to answer the question you asked (instead trying to answer a different "How to detect overflow or underflow that was not prevented" question that you didn't ask).
To prevent overflow and underflow for division during the design of software:
Determine the range of the numerator and find the values with the largest and smallest absolute magnitude
Determine the range of the divisor and find the values with the largest and smallest absolute magnitude
Make sure that the maximum representable value of the data type (e.g. FLT_MAX) divided by the largest absolute magnitude of the range of divisors is larger than the largest absolute magnitude of the range of numerators.
Make sure that the minimum representable value of the data type (e.g. FLT_MIN) multiplied by the smallest absolute magnitude of the range of divisors is smaller than the smallest absolute magnitude of the range of numerators.
Note that the last few steps may need to be repeated for each possible data type until you've found the "best" (smallest) data type that prevents underflow and underflow (e.g. you might check if float satisfies the last 2 steps and find that it doesn't, then check if double satisfies the last 2 steps and find that it does).
It's also possible that you find out that no data type is able to prevent overflow and underflow, and that you have to limit the range of values that could be used for numerator or divisor, or rearrange formulas (e.g. change a (c*a)/b into a (c/b)*a) or switch to a different representation ("double double", rational numbers, ...).
Also; be aware that this provides a guarantee that (for all combinations of values within your ranges) overflow and underflow will be prevented; but doesn't guarantee that the smallest data type will be chosen if there's some kind of relationship between the magnitudes of the numerators and divisors. For a simple example, if you're doing something like b = a*a+1; result = b/a; where the magnitude of the numerator depends on the magnitude of the divisor, then you'll never get the "largest numerator with smallest divisor" or "smallest numerator with largest divisor" cases and a smaller data type (that can't handle cases that won't exist) may be suitable.
Note that you can also do checks before each individual division. This tends to make performance worse (due to the branches/checks) while causing code duplication (e.g. providing alternative code that uses double for cases when float would've caused overflow or underflow); and can't work when the largest type supported isn't large enough (you end up with an } else { // Now what??? problem that can't be solved in a way that ensures values that should work do work because typically the only thing you can do is treat it as an error condition).

I don't know what standard your FL_64 adheres to, but if it's anything like IEEE 754, you'll want to watch out for
Not a Number
There might be a special NaN value. In some implementation, the result of comparing it to anything is 0, so if (variable_number == variable_number) == 0, then that's what's going on. There might be macros and functions to check for this depending on the implementation, such as in the GNU C Library.
Infinity
IEEE 754 also supports infinity (and negative infinity). This can be the result of an overflow, for instance. If variable_number is infinite and you divide it by constant_number, the result will probably be infinite again. As with NaN, the implementation usually supplies macros or functions to test for this, otherwise you could try dividing the number by something and see if it got any smaller.
Overflow
Since dividing the number by constant_number will make it bigger, the variable_number could overflow if it is already enormous. Check if it's not so big that this can happen. But depending on what your task is, the possibility of it being this large might already be excluded. The 64 bit floats in IEEE 754 go up to about 10^308. If your number overflows, it might turn into infinity.

I personally don't know the FL_64 variable type, from the name I suppose it has a 64 bit representation, but is it signed or unsigned?
Anyway I would see a potential problem only if the type is signed, otherwise both the quotient and reminder would be re-presentable on the same quantity of bits.
In case of signed, you need to check the result sign:
FL_64 result = variable_number/constant_number;
if ((variable_number > 0 && constant_number > 0) || (variable_number < 0 && constant_number < 0)) {
if (result < 0) {
//OVER/UNDER FLOW
printf("over/under flow");
} else {
//NO OVER/UNDER FLOW
printf("no over/under flow");
}
} else {
if (result < 0) {
//NO OVER/UNDER FLOW
printf("no over/under flow");
} else {
//OVER/UNDER FLOW
printf("over/under flow");
}
}
Also other cases should be checked, like division by 0. But as you mentioned constant_number is always fixed and different from 0.
EDIT:
Ok so there could be another way to check overflow by using the DBL_MAX value. By having the maximum re-presentable number on a double you can multiply it by the constant_number and compute the maximum value for the variable_number. From the code snippet below, you can see that the first case does not cause overflow, while the second does (since the variable_number is a larger number compared to the test). From the console output in fact you can see that the first value result is higher than the second one, even if this should actually be the double of the previous one. So this case is an overflow case.
#include <stdio.h>
#include <float.h>
typedef double FL_64;
int main() {
FL_64 constant_number = (FL_64)0.0000176019966602325;
FL_64 test = DBL_MAX * constant_number;
FL_64 variable_number = test;
FL_64 result;
printf("MAX double value:\n%f\n\n", DBL_MAX);
printf("Variable Number value:\n%f\n\n", variable_number);
printf(variable_number > test ? "Overflow case\n\n" : "No overflow\n\n");
result = variable_number / constant_number;
printf("Result: %f\n\n", variable_number);
variable_number *= 2;
printf("Variable Number value:\n%f\n\n", variable_number);
printf(variable_number > test ? "Overflow case\n\n" : "No overflow\n\n");
result = variable_number / constant_number;
printf("Result:\n%f\n\n", variable_number);
return 0;
}
This a specific case solution, since you have a constant value number. But this solution will not work in a general case.

Related

What happens if we keep dividing float 1.0 by 2 until it reaches zero?

float f = 1.0;
while (f != 0.0) f = f / 2.0;
This loop runs 150 times using 32-bit precision. Why is that so? Is it getting rounded to zero?
In common C implementations, the IEEE-754 binary32 format is used for float. It is also called “single precision.” It is a binary based format where finite numbers are represented as ±f•2e, where f is a 24-bit binary numeral in [1, 2) and e is an integer in [−126, 127].
In this format, 1 is represented as +1.000000000000000000000002•20. Dividing that by 2 yields ½, which is represented as +1.000000000000000000000002•2−1. Dividing that by 2 yields +1.000000000000000000000002•2−2, then +1.000000000000000000000002•2−3, and so on until we reach +1.000000000000000000000002•2−126.
When that is divided by two, the mathematical result is +1.000000000000000000000002•2−127, but −127 is below the normal exponent range, [−126, 127]. Instead, the significand becomes denormalized; 2−127 is represented by +0.100000000000000000000002•2−126. Dividing that by 2 yields +0.010000000000000000000002•2−126, then +0.001000000000000000000002•2−126, +0.000100000000000000000002•2−126, and so on until we get to +0.000000000000000000000012•2−126.
At this point, we have done 149 divisions by 2; +0.000000000000000000000012•2−126 is 2−149.
When the next division is performed, the result would be 2−150, but that is not representable in this format. Even with the lowest non-zero significand, 0.000000000000000000000012, and the lowest exponent, −126, we cannot get to 2−150. The next lower representable number is +0.000000000000000000000002•2−126, which equals 0.
So, the real-number-arithmetic result of the division would be 2−150, but we cannot represent that in this format. The two nearest representable numbers are +0.000000000000000000000012•2−126 just above it and +0.000000000000000000000002•2−126 just below it. They are equally near 2−150. The default rounding method is to take the nearest representable number and, in case of ties, to take the number with the even low digit. So +0.000000000000000000000002•2−126 wins the tie, and that is produced as the result for the 150th division.
What happens is simply that your system has only a limited number of bits available for a variable, and hence limited precision; even though, mathematically, you can halve a number (!= 0) indefinitely without ever reaching zero, in a computer implementation that has a limited precision for a float variable, that variable will inevitably, at some stage, become indistinguishable from zero. The more bits your system uses, the more precision it has and the later this will happen, but at some stage it will.
Since I suppose this is meant to be C, I just implemented it in C (with a counter counting each iteration), and indeed it ran for 150 rounds until the loop ended. I also implemented it with a double, where it ran for 1075 iterations. Keep in mind, however, that the C standard does not define the exact precision of a float variable. In most implementations it's 32 bits for a float and 64 for a double. With a long double, I get 16,446 iterations.

C thinking : float vs. integers and float representation

When using integers in C (and in many other languages), one must pay attention when dividing about precision. It is always better to multiply and add things (thus creating a larger intermediary result, so long as it doesn't overflow) before dividing.
But what about floats? Does that still hold? Or are they represented in such a way that it is better to divide number of similar orders of magnitude rather than large ones by small ones?
The representation of floats/doubles and similar floating-point working, is geared towards retaining numbers of significant digits (aka "precision"), rather than a fixed number of decimal places, such as happens in fixed-point, or integer working.
It is best to avoid combining quantities, that may give rise to implicit under or overflow in terms of the exponent, ie at the limits of the floating-point number range.
Hence, addition/subtraction of quantities of widely differing magnitudes (either explicitly, or due to having opposite signs)) should be avoided and re-arranged, where possible, to avoid this well-known route to lost precision.
Example: it's better to refactor/re-order
small + big + small + big + small * big
as
(small+small+small) + big + big
since the smalls individually might make no difference to a big, and hence their contribution might disappear.
If there is any "noise" or imprecision in the lower bits of any quantity, it's also wise to be aware how loss of significant bits propagates through a computation.
With integers:
As long as there is no overflow, +,-,* is always exact.
With division, the result is truncated and often not equal to the mathematical answer.
ia,ib,ic, multiplying before dividing ia*ib/ic vs ia*(ib/ic) is better as the quotient is based on more bits of the product ia*ib than ib.
With floating point:
Issues are subtle. Again, as long as no over/underflow, the order or *,/ sequence make less impact than with integers. FP */- is akin to adding/subtracting logs. Typical results are within 0.5 ULP of the mathematically correct answer.
With FP and +,- the result of fa,fb,fc can have significant differences than the mathematical correct one when 1) values are far apart in magnitude or 2) subtracting values that are nearly equal and the error in a prior calculation now become significant.
Consider the quadratic equation:
double d = sqrt(b*b - 4*a/c); // assume b*b - 4*a/c >= 0
double root1 = (-b + d)/(2*a);
double root2 = (-b - d)/(2*a);
Versus
double d = sqrt(b*b - 4*a/c); // assume b*b - 4*a/c >= 0
double root1 = (b < 0) ? (-b + d)/(2*a) : (-b - d)/(2*a)
double root2 = c/(a*root1); // assume a*root1 != 0
The 2nd has much better root2 precision result when one root is near 0 and |b| is nearly d. This is because the b,d subtraction cancels many bits of significance allowing the error in the calculation of d to become significant.
(for integer) It is always better to multiply and add things (thus creating a larger intermediary result, so long as it doesn't overflow) before dividing.
Does that still hold (for floats)?
In general the answer is No
It is easy to construct an example where adding all input before division will give you a huge rounding error.
Assume you want to add 10000000000 values and divide them by 1000. Further assume that each value is 1. So the expected result is 10000000.
Method 1
However, if you add all the values before division, you'll get the result 16777.216 (for a 32 bit float). As you can see it is pretty much off.
Method 2
So is it better to divide each value by 1000 before adding it to the result? If you do that, you'll get the result 32768.0 (for a 32 bit float). As you can see it is pretty much off as well.
Method 3
However, if you go on adding values until the temporary result is greater than 1000000 and then divide the temporary result by 1000 and add that intermediate result to the final result and repeats that until you have added a total 10000000000 values, you will get the correct result.
So there is no simple "always add before division" or "always divide before adding" when dealing with floating point. As a general rule it is typically a good idea to keep operands in similar magnitude. That is what the third example does.

Double value as a negative zero

I have a program example:
int main()
{
double x;
x=-0.000000;
if(x<0)
{
printf("x is less");
}
else
{
printf("x is greater");
}
}
Why does the control goes in the first statement - x is less . What is -0.000000?
IEEE 754 defines a standard floating point numbers, which is very commonly used. You can see its structure here:
Finite numbers, which may be either base 2 (binary) or base 10
(decimal). Each finite number is described by three integers: s = a
sign (zero or one), c = a significand (or 'coefficient'), q = an
exponent. The numerical value of a finite number is
(−1)^s × c × bq
where b is the base (2 or 10). For example, if the sign is 1
(indicating negative), the significand is 12345, the exponent is −3,
and the base is 10, then the value of the number is −12.345.
So if the fraction is 0, and the sign is 0, you have +0.0.
And if the fraction is 0, and the sign is 1, you have -0.0.
The numbers have the same value, but they differ in the positive/negative check. This means, for instance, that if:
x = +0.0;
y = -0.0;
Then you should see:
(x -y) == 0
However, for x, the OP's code would go with "x is greater", while for y, it would go with "x is less".
Edit: Artur's answer and Jeffrey Sax's comment to this answer clarify that the difference in the test for x < 0 in the OP's question is actually a compiler optimization, and that actually the test for x < 0 for both positive and negative 0 should always be false.
Negative zero is still zero, so +0 == -0 and -0 < +0 is false. They are two representations of the same value. There are only a few operations for which it makes a difference:
1 / -0 = -infinity, while 1 / +0 = +infinity.
sqrt(-0) = -0, while sqrt(+0) = +0
Negative zero can be created in a few different ways:
Dividing a positive number by -infinity, or a negative number by +infinity.
An operation that produces an underflow on a negative number.
This may seem rather obscure, but there is a good reason for this, mainly to do with making mathematical expressions involving complex numbers consistent. For example, note that the identity 1/√(-z)==-1/√z is not correct unless you define the square root as I did above.
If you want to know more details, try and find William Kahan's Branch Cuts for Complex Elementary Functions, or Much Ado About Nothing's Sign Bit in The State of the Art in Numerical Analysis (1987).
Nathan is right but there is one issue though. Usually most of float/double operations are performed by coprocessor. However some compilers try to be clever and instead of letting coprocessor do the comparison (it treats -0.0 and +0.0 the same as 0.0) just assume that since your x variable has minus sign it means that it should be treated as negative and optimize your code.
If you would be able to see how assembly output looks like - I bet you'll only see call to:
printf("x is less");
So it is optimization stuff (bad optimization).
BTW - VC 2008 produces correct output here regardless of optimization level set.
For example - VC optimizes (at full/max optimization level) the code leaving this only:
printf("x is grater");
I like my compiler more every day ;-)

sqrt, perfect squares and floating point errors

In the sqrt function of most languages (though here I'm mostly interested in C and Haskell), are there any guarantees that the square root of a perfect square will be returned exactly? For example, if I do sqrt(81.0) == 9.0, is that safe or is there a chance that sqrt will return 8.999999998 or 9.00000003?
If numerical precision is not guaranteed, what would be the preferred way to check that a number is a perfect square? Take the square root, get the floor and the ceiling and make sure they square back to the original number?
Thank you!
In IEEE 754 floating-point, if the double-precision value x is the square of a nonnegative representable number y (i.e. y*y == x and the computation of y*y does not involve any rounding, overflow, or underflow), then sqrt(x) will return y.
This is all because sqrt is required to be correctly-rounded by the IEEE 754 standard. That is, sqrt(x), for any x, will be the closest double to the actual square root of x. That sqrt works for perfect squares is a simple corollary of this fact.
If you want to check whether a floating-point number is a perfect square, here's the simplest code I can think of:
int issquare(double d) {
if (signbit(d)) return false;
feclearexcept(FE_INEXACT);
double dd = sqrt(d);
asm volatile("" : "+x"(dd));
return !fetestexcept(FE_INEXACT);
}
I need the empty asm volatile block that depends on dd because otherwise your compiler might be clever and "optimise" away the calculation of dd.
I used a couple of weird functions from fenv.h, namely feclearexcept and fetestexcept. It's probably a good idea to look at their man pages.
Another strategy that you might be able to make work is to compute the square root, check whether it has set bits in the low 26 bits of the mantissa, and complain if it does. I try this approach below.
And I needed to check whether d is zero because otherwise it can return true for -0.0.
EDIT: Eric Postpischil suggested that hacking around with the mantissa might be better. Given that the above issquare doesn't work in another popular compiler, clang, I tend to agree. I think the following code works:
int _issquare2(double d) {
if (signbit(d)) return 0;
int foo;
double s = sqrt(d);
double a = frexp(s, &foo);
frexp(d, &foo);
if (foo & 1) {
return (a + 33554432.0) - 33554432.0 == a && s*s == d;
} else {
return (a + 67108864.0) - 67108864.0 == a;
}
}
Adding and subtracting 67108864.0 from a has the effect of wiping the low 26 bits of the mantissa. We will get a back exactly when those bits were clear in the first place.
According to this paper, which discusses proving the correctness of IEEE floating-point square root:
The IEEE-754 Standard for Binary Floating-Point
Arithmetic [1] requires that the result of a divide or square
root operation be calculated as if in infinite precision, and
then rounded to one of the two nearest floating-point
numbers of the specified precision that surround the
infinitely precise result
Since a perfect square that can be represented exactly in floating-point is an integer and its square root is an integer that can be precisely represented, the square root of a perfect square should always be exactly correct.
Of course, there's no guarantee that your code will execute with a conforming IEEE floating-point library.
#tmyklebu perfectly answered the question. As a complement, let's see a possibly less efficient alternative for testing perfect square of fractions without asm directive.
Let's suppose we have an IEEE 754 compliant sqrt which rounds the result correctly.
Let's suppose exceptional values (Inf/Nan) and zeros (+/-) are already handled.
Let's decompose sqrt(x) into I*2^m where I is an odd integer.
And where I spans n bits: 1+2^(n-1) <= I < 2^n.
If n > 1+floor(p/2) where p is floating point precision (e.g. p=53 and n>27 in double precision)
Then 2^(2n-2) < I^2 < 2^2n.
As I is odd, I^2 is odd too and thus spans over > p bits.
Thus I is not the exact square root of any representable floating point with this precision.
But given I^2<2^p, could we say that x was a perfect square?
The answer is obviously no. A taylor expansion would give
sqrt(I^2+e)=I*(1+e/2I - e^2/4I^2 + O(e^3/I^3))
Thus, for e=ulp(I^2) up to sqrt(ulp(I^2)) the square root is correctly rounded to rsqrt(I^2+e)=I... (round to nearest even or truncate or floor mode).
Thus we would have to assert that sqrt(x)*sqrt(x) == x.
But above test is not sufficient, for example, assuming IEEE 754 double precision, sqrt(1.0e200)*sqrt(1.0e200)=1.0e200, where 1.0e200 is exactly 99999999999999996973312221251036165947450327545502362648241750950346848435554075534196338404706251868027512415973882408182135734368278484639385041047239877871023591066789981811181813306167128854888448 whose first prime factor is 2^613, hardly a perfect square of any fraction...
So we can combine both tests:
#include <float.h>
bool is_perfect_square(double x) {
return sqrt(x)*sqrt(x) == x
&& squared_significand_fits_in_precision(sqrt(x));
}
bool squared_significand_fits_in_precision(double x) {
double scaled=scalb( x , DBL_MANT_DIG/2-ilogb(x));
return scaled == floor(scaled)
&& (scalb(scaled,-1)==floor(scalb(scaled,-1)) /* scaled is even */
|| scaled < scalb( sqrt((double) FLT_RADIX) , DBL_MANT_DIG/2 + 1));
}
EDIT:
If we want to restrict to the case of integers, we can also check that floor(sqrt(x))==sqrt(x) or use dirty bit hacks in squared_significand_fits_in_precision...
Instead of doing sqrt(81.0) == 9.0, try 9.0*9.0 == 81.0. This will always work as long as the square is within the limits of the floating point magnitude.
Edit: I was probably unclear about what I meant by "floating point magnitude". What I mean is to keep the number within the range of integer values that can be held without precision loss, less than 2**53 for a IEEE double. I also expected that there would be a separate operation to make sure the square root was an integer.
double root = floor(sqrt(x) + 0.5); /* rounded result to nearest integer */
if (root*root == x && x < 9007199254740992.0)
/* it's a perfect square */

Why does this floating-point loop terminate at 1,000,000?

The answer to this sample homework problem is "1,000,000", but I do not understand why:
What is the output of the following code?
int main(void) {
float k = 1;
while (k != k + 1) {
k = k + 1;
}
printf(“%g”, k); // %g means output a floating point variable in decimal
}
If the program runs indefinitely but produces no output, write INFINITE LOOP as the answer to the question. All of the programs compile and run. They may or may not contain serious errors, however. You should assume that int is four bytes. You should assume that float has the equivalent of six decimal digits of precision. You may round your answer off to the nearest power of 10 (e.g., you can say 1,000 instead of 210 (i.e., 1024)).
I do not understand why the loop would ever terminate.
It doesn't run forever for the simple reason that floating point numbers are not perfect.
At some point, k will become big enough so that adding 1 to it will have no effect.
At that point, k will be equal to k+1 and your loop will exit.
Floating point numbers can be differentiated by a single unit only when they're in a certain range.
As an example, let's say you have an integer type with 3 decimal digits of precision for a positive integer and a single-decimal-digit exponent.
With this, you can represent the numbers 0 through 999 perfectly as 000x100 through 999x100 (since 100 is 1):
What happens when you want to represent 1000? You need to use 100x101. This is still represented perfectly.
However, there is no accurate way to represent 1001 with this scheme, the next number you can represent is 101x101 which is 1010.
So, when you add 1 to 1000, you'll get the closest match which is 1000.
The code is using a float variable.
As specified in the question, float has 6 digits of precision, meaning that any digits after the sixth will be inaccurate. Therefore, once you pass a million, the final digit will be inaccurate, so that incrementing it can have no effect.
The output of this program is not specified by the C standard, since the semantics of the float type are not specified. One likely result (what you will get on a platform for which float arithmetic is evaluated in IEEE-754 single precision) is 2^24.
All integers smaller than 2^24 are exactly representable in single precision, so the computation will not stop before that point. The next representable single precision number after 2^24, however, is 2^24 + 2. Since 2^24 + 1 is exactly halfway between that number and 2^24, in the default IEEE-754 rounding mode it rounds to the one whose trailing bit is zero, which is 2^24.
Other likely answers include 2^53 and 2^64. Still other answers are possible. Infinity (the floating-point value) could result on a platform for which the default rounding mode is round up, for example. As others have noted, an infinite loop is also possible on platforms that evaluate floating-point expressions in a wider type (which is the source of all sorts of programmer confusion, but allowed by the C standard).
Actually, on most C compilers, this will run forever (infinite loop), though the precise behavior is implementation defined.
The reason that most compilers will give an infinite loop is that they evaluate all floating point expressions at double precision and only round values back to float (single) precision when storing into a variable. So when the value of k gets to about 2^24, k == k + 1 will still evaluate as false (as a double can hold the value k+1 without rounding), but the k = k + 1 assignment will be a noop, as k+1 needs to be rounded to fit into a float
edit
gcc on x86 gets this infinite loop behavior. Interestingly on x64 it does not, as it uses sse instructions which do the comparison in float precision.

Resources