Currently I'm learning about floating point exceptions. I'm writing a loop with a function. In that function a value is calculated that equals 0.5. As the loop proceeds, the input value gets divided by 10.
The loop:
for(i = 0; i < e; i++)
{
xf /= 10.0; // force increasingly smaller values for x
float_testk (xf, i);
}
The function:
void float_testk(float x, int i)
{
float result;
feclearexcept(FE_ALL_EXCEPT); // clear all pending exceptions
result = (1 - cosf(x)) / (x * x);
if(fetestexcept(FE_UNDERFLOW) != 0)
fprintf(stderr,"Underflow occurred in double_testk!\n");
if(fetestexcept(FE_OVERFLOW) != 0)
fprintf(stderr,"Overflow occurred in double_testk!\n");
if(fetestexcept(FE_INVALID) != 0)
fprintf(stderr,"Invalid exception occurred in double_testk!\n");
printf("Iteration %3d, float result for x=%.8f : %f\n",i,x,result);
}
The first few iterations the output is around 0.5 and later it becomes 0 due CC. After a while this is the output of the program:
Iteration 18, float result for x=0.00000000 : 0.000000
Underflow occurred in double_testk!
Iteration 19, float result for x=0.00000000 : 0.000000
Underflow occurred in double_testk!
Iteration 20, float result for x=0.00000000 : 0.000000
Underflow occurred in double_testk!
Iteration 21, float result for x=0.00000000 : 0.000000
Underflow occurred in double_testk!
Invalid exception occurred in double_testk!
Iteration 22, float result for x=0.00000000 : -nan
Underflow occurred in double_testk!
Invalid exception occurred in double_testk!
I want to know what happens at the transition from underflow to NaN. Because underflow means that the number is too small to be stored in the memory.
But if the number is already too small, what is the goal of NaN?
Because underflow means that the number is too small to be stored in the memory.
Not quite; in floating-point, underflow means a result is below the range at which numbers can be represented with full precision. The result may still be fairly accurate.
As long as x is at least 2−75, x * x produces a non-zero result. It may be in the subnormal part of the floating-point domain, where the precision is declining, but the real-number result of x•x is big enough to round to 2−149 or greater. Then, for these small x, (1 - cosf(x)) / (x * x) evaluates to zero divided by a non-zero value, so the result is zero.
When x is less than 2−75, then x * x produces zero, because the real-number result of x•x is so small that, in floating-point arithmetic, it is rounded to zero. Then (1 - cosf(x)) / (x * x) evaluates to zero divided by zero, so the result is a NaN. This is what happens in your iteration 22.
(2−149 is the smallest positive value representable in IEEE-754 binary32, which your C implementation likely uses for float. Real-number results between that and 2−150 will round up to 2−149. Lower results will round down to 0. Assuming the rounding mode is round-to-nearest, ties-to-even.)
NaN is a concept defined in IEEE 754 standard for floating-point arithmetic, not being a number is not the same as negative infinity or positive infinity, NaN is used for arithmetic values that cannot be represented, not because they are too small or too large but simply because they don't exist. Examples:
1/0 = ∞ //too large
log (0) = -∞ //too small
sqrt (-1) = NaN //is not a number, can't be calculated
IEEE 754 floating point numbers can represent positive or negative infinity, and NaN (not a number). These three values arise from calculations whose result is undefined or cannot be represented accurately. You can also deliberately set a floating-point variable to any of them, which is sometimes useful. Some examples of calculations that produce infinity or NaN:
The goal of these flags you are using is to be compiant with the mentioned standard. It specifies five arithmetic exceptions that are to be recorded in the status flags:
FE_INEXACT: Inexact result, rounding was necessary to store the result of an earlier floating-point operation.
Set if the rounded (and returned) value is different from the mathematically exact result of the operation.
FE_UNDERFLOW: The result of an earlier floating-point operation was subnormal with a loss of precision.
Set if the rounded value is tiny (as specified in IEEE 754) and inexact (or maybe limited to if it has denormalization loss, as per the 1984 version of IEEE 754), returning a subnormal value including the zeros.
FE_OVERFLOW: The result of an earlier floating-point operation was too large to be representable.
Set if the absolute value of the rounded value is too large to be represented. An infinity or maximal finite value is returned, depending on which rounding is used.
FE_DIVBYZERO: Pole error occurred in an earlier floating-point operation.
Set if the result is infinite given finite operands, returning an infinity, either +∞ or −∞.
FE_INVALID: Domain error occurred in an earlier floating-point operation.
Set if a real-valued result cannot be returned e.g. sqrt(−1) or 0/0, returning a quiet NaN.*
*Concept of quiet NaN:
Quiet NaNs, or qNaNs, do not raise any additional exceptions as they propagate through most operations. The exceptions are where the NaN cannot simply be passed through unchanged to the output, such as in format conversions or certain comparison operations.
Sources:
Floating point environment
Floating-point arithmetic
20.5.2 Infinity and NaN
NaN
Related
float f = 1.0;
while (f != 0.0) f = f / 2.0;
This loop runs 150 times using 32-bit precision. Why is that so? Is it getting rounded to zero?
In common C implementations, the IEEE-754 binary32 format is used for float. It is also called “single precision.” It is a binary based format where finite numbers are represented as ±f•2e, where f is a 24-bit binary numeral in [1, 2) and e is an integer in [−126, 127].
In this format, 1 is represented as +1.000000000000000000000002•20. Dividing that by 2 yields ½, which is represented as +1.000000000000000000000002•2−1. Dividing that by 2 yields +1.000000000000000000000002•2−2, then +1.000000000000000000000002•2−3, and so on until we reach +1.000000000000000000000002•2−126.
When that is divided by two, the mathematical result is +1.000000000000000000000002•2−127, but −127 is below the normal exponent range, [−126, 127]. Instead, the significand becomes denormalized; 2−127 is represented by +0.100000000000000000000002•2−126. Dividing that by 2 yields +0.010000000000000000000002•2−126, then +0.001000000000000000000002•2−126, +0.000100000000000000000002•2−126, and so on until we get to +0.000000000000000000000012•2−126.
At this point, we have done 149 divisions by 2; +0.000000000000000000000012•2−126 is 2−149.
When the next division is performed, the result would be 2−150, but that is not representable in this format. Even with the lowest non-zero significand, 0.000000000000000000000012, and the lowest exponent, −126, we cannot get to 2−150. The next lower representable number is +0.000000000000000000000002•2−126, which equals 0.
So, the real-number-arithmetic result of the division would be 2−150, but we cannot represent that in this format. The two nearest representable numbers are +0.000000000000000000000012•2−126 just above it and +0.000000000000000000000002•2−126 just below it. They are equally near 2−150. The default rounding method is to take the nearest representable number and, in case of ties, to take the number with the even low digit. So +0.000000000000000000000002•2−126 wins the tie, and that is produced as the result for the 150th division.
What happens is simply that your system has only a limited number of bits available for a variable, and hence limited precision; even though, mathematically, you can halve a number (!= 0) indefinitely without ever reaching zero, in a computer implementation that has a limited precision for a float variable, that variable will inevitably, at some stage, become indistinguishable from zero. The more bits your system uses, the more precision it has and the later this will happen, but at some stage it will.
Since I suppose this is meant to be C, I just implemented it in C (with a counter counting each iteration), and indeed it ran for 150 rounds until the loop ended. I also implemented it with a double, where it ran for 1075 iterations. Keep in mind, however, that the C standard does not define the exact precision of a float variable. In most implementations it's 32 bits for a float and 64 for a double. With a long double, I get 16,446 iterations.
I have two numbers:
FL_64 variable_number;
FL_64 constant_number;
The constant number is always the same, for example:
constant_number=(FL_64)0.0000176019966602325;
The variable number is given to me and I need to perform the division:
FL_64 result = variable_number/constant_number;
What would be the checks I need to do to variable_number in order to make sure the operation will not overflow / underflow before performing it?
Edit: FL_64 is just a typedef for double so FL_64 = double.
A Test For Overflow
Assume:
The C implementation uses IEEE-754 arithmetic with round-to-nearest-ties-to-even.
The magnitude of the divisor is at most 1, and the divisor is non-zero.
The divisor is positive.
The test and the proof below are written with the above assumptions for simplicity, but the general cases are easily handled:
If the divisor might be negative, use fabs(divisor) in place of divisor when calculating the limit shown below.
If the divisor is zero, there is no need to test for overflow, as it is already known an error (divide-by-zero) occurs.
If the magnitude exceeds 1, the division never creates a new overflow. Overflow occurs only if the dividend is already infinity (so a test would be isinf(candidate)). (With a divisor exceeding 1 in magnitude, the division could underflow. This answer does not discuss testing for underflow in that case.)
Note about notation: Expressions using non-code-format operators, such as x•y, represent exact mathematical expressions, without floating-point rounding. Expressions in code format, such as x*y, mean the computed results with floating-point rounding.
To detect overflow when dividing by divisor, we can use:
FL_64 limit = DBL_MAX * divisor;
if (-limit <= candidate && candidate <= limit)
// Overflow will not occur.
else
// Overflow will occur or candidate or divisor is NaN.
Proof:
limit will equal DBL_MAX multiplied by divisor and rounded to the nearest representable value. This is exactly DBL_MAX•divisor•(1+e) for some error e such that −2−53 ≤ e ≤ 2−53, by the properties of rounding to nearest plus the fact that no representable value for divisor can, when multiplied by DBL_MAX, produce a value below the normal range. (In the subnormal range, the relative error due to rounding could be greater than 2−53. Since the product remains in the normal range, that does not occur.)
However, e = 2−53 can occur only if the exact mathematical value of DBL_MAX•divisor falls exactly midway between two representable values, thus requiring it to have 54 significant bits (the bit that is ½ of the lowest position of the 53-bit significand of representable values is the 54th bit, counting from the leading bit). We know the significand of DBL_MAX is 1fffffffffffff16 (53 bits). Multiplying it by odd numbers produces 1fffffffffffff16 (when multiplied by 1), 5ffffffffffffd16 (by 3), and 0x9ffffffffffffb16 (by 5), and numbers with more significant bits when multiplied by greater odd numbers. Note that 5ffffffffffffd16 has 55 significant bits. None of these has exactly 54 significant bits. When multiplied by even numbers, the product has trailing zeros, so the number of significant bits is the same as when multiplying by the odd number that results from dividing the even number by the greatest power of two that divides it. Therefore, no product of DBL_MAX is exactly midway between two representable values, so the error e is never exactly 2−53. So −253 < e < 2−53.
So, limit = DBL_MAX•divisor•(1+e), where e < 2−53. Therefore limit/divisor is DBL_MAX•(1+e). Since this result is less than ½ ULP from DBL_MAX, it never rounds up to infinity, so it never overflows. So dividing any candidate that is less than or equal to limit by divisor does not overflow.
Now we will consider candidates exceeding limit. As with the upper bound, e cannot equal −2−53, for the same reason. Then the least e can be is −2−53 + 2−105, because the product of DBL_MAX and divisor has at most 106 significant bits, so any increase from the midpoint between two representable values must be by at least one part in 2−105. Then, if limit < candidate, candidate is at least one part in 2−52 greater than limit, since there are 53 bits in a significand. So DBL_MAX•divisor•(1−2−53+2−105)•(1+2−52) < candidate. Then candidate/divisor is at least DBL_MAX•(1−2−53+2−105)•(1+2−52), which is DBL_MAX•(1+2−53+2−157). The exceeds the midpoint between DBL_MAX and what would be the next representable value if the exponent range were unbounded, which is the basis for the IEEE-754 rounding criterion. Therefore, it rounds up to infinity, so overflow occurs.
Underflow
Dividing by a number with magnitude less than one of course makes a number larger in magnitude, so it never underflows to zero. However, the IEEE-754 definition of underflow is that a non-zero result is tiny (in the subnormal range), either before or after rounding (whether to use before or after is implementation-defined). It is of course possible that dividing a subnormal number by a divisor less than one will produce a result still in the subnormal range. However, for this to happen, underflow must have occurred previously, to get the subnormal dividend in the first place. Therefore, underflow will never be introduced by a division by a number with magnitude less than one.
If one does wish to test for this underflow, one might similarly to the test for overflow—by comparing the candidate to the minimum normal (or the greatest subnormal) multiplied by divisor—but I have not yet worked through the numerical properties.
Assuming FL_64 is something like a double you can get the maximum value which is named DBL_MAX from float.h
So you want to make sure that
DBL_MAX >= variable_number/constant_number
or equally
DBL_MAX * constant_number >= variable_number
In code that could be something like
if (constant_number > 0.0 && constant_number < 1.0)
{
if (DBL_MAX * constant_number >= variable_number)
{
// wont overflow
}
else
{
// will overflow
}
}
else
{
// add code for other ranges of constant_number
}
However, notice that floating point calculations are imprecise so there maybe corner cases where the above code will fail.
I'm going to attempt to answer the question you asked (instead trying to answer a different "How to detect overflow or underflow that was not prevented" question that you didn't ask).
To prevent overflow and underflow for division during the design of software:
Determine the range of the numerator and find the values with the largest and smallest absolute magnitude
Determine the range of the divisor and find the values with the largest and smallest absolute magnitude
Make sure that the maximum representable value of the data type (e.g. FLT_MAX) divided by the largest absolute magnitude of the range of divisors is larger than the largest absolute magnitude of the range of numerators.
Make sure that the minimum representable value of the data type (e.g. FLT_MIN) multiplied by the smallest absolute magnitude of the range of divisors is smaller than the smallest absolute magnitude of the range of numerators.
Note that the last few steps may need to be repeated for each possible data type until you've found the "best" (smallest) data type that prevents underflow and underflow (e.g. you might check if float satisfies the last 2 steps and find that it doesn't, then check if double satisfies the last 2 steps and find that it does).
It's also possible that you find out that no data type is able to prevent overflow and underflow, and that you have to limit the range of values that could be used for numerator or divisor, or rearrange formulas (e.g. change a (c*a)/b into a (c/b)*a) or switch to a different representation ("double double", rational numbers, ...).
Also; be aware that this provides a guarantee that (for all combinations of values within your ranges) overflow and underflow will be prevented; but doesn't guarantee that the smallest data type will be chosen if there's some kind of relationship between the magnitudes of the numerators and divisors. For a simple example, if you're doing something like b = a*a+1; result = b/a; where the magnitude of the numerator depends on the magnitude of the divisor, then you'll never get the "largest numerator with smallest divisor" or "smallest numerator with largest divisor" cases and a smaller data type (that can't handle cases that won't exist) may be suitable.
Note that you can also do checks before each individual division. This tends to make performance worse (due to the branches/checks) while causing code duplication (e.g. providing alternative code that uses double for cases when float would've caused overflow or underflow); and can't work when the largest type supported isn't large enough (you end up with an } else { // Now what??? problem that can't be solved in a way that ensures values that should work do work because typically the only thing you can do is treat it as an error condition).
I don't know what standard your FL_64 adheres to, but if it's anything like IEEE 754, you'll want to watch out for
Not a Number
There might be a special NaN value. In some implementation, the result of comparing it to anything is 0, so if (variable_number == variable_number) == 0, then that's what's going on. There might be macros and functions to check for this depending on the implementation, such as in the GNU C Library.
Infinity
IEEE 754 also supports infinity (and negative infinity). This can be the result of an overflow, for instance. If variable_number is infinite and you divide it by constant_number, the result will probably be infinite again. As with NaN, the implementation usually supplies macros or functions to test for this, otherwise you could try dividing the number by something and see if it got any smaller.
Overflow
Since dividing the number by constant_number will make it bigger, the variable_number could overflow if it is already enormous. Check if it's not so big that this can happen. But depending on what your task is, the possibility of it being this large might already be excluded. The 64 bit floats in IEEE 754 go up to about 10^308. If your number overflows, it might turn into infinity.
I personally don't know the FL_64 variable type, from the name I suppose it has a 64 bit representation, but is it signed or unsigned?
Anyway I would see a potential problem only if the type is signed, otherwise both the quotient and reminder would be re-presentable on the same quantity of bits.
In case of signed, you need to check the result sign:
FL_64 result = variable_number/constant_number;
if ((variable_number > 0 && constant_number > 0) || (variable_number < 0 && constant_number < 0)) {
if (result < 0) {
//OVER/UNDER FLOW
printf("over/under flow");
} else {
//NO OVER/UNDER FLOW
printf("no over/under flow");
}
} else {
if (result < 0) {
//NO OVER/UNDER FLOW
printf("no over/under flow");
} else {
//OVER/UNDER FLOW
printf("over/under flow");
}
}
Also other cases should be checked, like division by 0. But as you mentioned constant_number is always fixed and different from 0.
EDIT:
Ok so there could be another way to check overflow by using the DBL_MAX value. By having the maximum re-presentable number on a double you can multiply it by the constant_number and compute the maximum value for the variable_number. From the code snippet below, you can see that the first case does not cause overflow, while the second does (since the variable_number is a larger number compared to the test). From the console output in fact you can see that the first value result is higher than the second one, even if this should actually be the double of the previous one. So this case is an overflow case.
#include <stdio.h>
#include <float.h>
typedef double FL_64;
int main() {
FL_64 constant_number = (FL_64)0.0000176019966602325;
FL_64 test = DBL_MAX * constant_number;
FL_64 variable_number = test;
FL_64 result;
printf("MAX double value:\n%f\n\n", DBL_MAX);
printf("Variable Number value:\n%f\n\n", variable_number);
printf(variable_number > test ? "Overflow case\n\n" : "No overflow\n\n");
result = variable_number / constant_number;
printf("Result: %f\n\n", variable_number);
variable_number *= 2;
printf("Variable Number value:\n%f\n\n", variable_number);
printf(variable_number > test ? "Overflow case\n\n" : "No overflow\n\n");
result = variable_number / constant_number;
printf("Result:\n%f\n\n", variable_number);
return 0;
}
This a specific case solution, since you have a constant value number. But this solution will not work in a general case.
In C, on a implementation with IEEE-754 floats, when I compare two floating point numbers which are NaN, it return 0 or "false". But why do two floating point numbers which both are inf count as equal?
This Program prints "equal: ..." (at least under Linux AMD64 with gcc) and in my opinion it should print "different: ...".
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
volatile double a = 1e200; //use volatile to suppress compiler warnings
volatile double b = 3e200;
volatile double c = 1e200;
double resA = a * c; //resA and resB should by inf
double resB = b * c;
if (resA == resB)
{
printf("equal: %e * %e = %e = %e = %e * %e\n",a,c,resA,resB,b,c);
}
else
{
printf("different: %e * %e = %e != %e = %e * %e\n", a, c, resA, resB, b, c);
}
return EXIT_SUCCESS;
}
A other example, why I think inf is not the same as inf, is: the numbers of natural numbers and rational numbers, both are infinite but not the same.
So why is inf == inf?
Infinities compare equal because that's what the standard says. From section 5.11 Details of comparison predicates:
Infinite operands of the same sign shall compare equal.
inf==inf for the same reason that almost all floating point numbers compare equal to themselves: Because they're equal. They contain the same sign, exponent, and mantissa.
You might be thinking of how NaN != NaN. But that's a relatively unimportant consequence of a much more important invariant: NaN != x for any x. As the name implies, NaN is not any number at all, and hence cannot compare equal to anything, because the comparison in question is a numeric one (hence why -0 == +0).
It would certainly make some amount of sense to have inf compare unequal to other infs, since in a mathematical context they're almost certainly unequal. But keep in mind that floating point equality is not the same thing as absolute mathematical equality; 0.1f * 10.0f != 1.0f, and 1e100f + 1.0f == 1e100f. Just as floating point numbers gradually underflow into denormals without compromising as-good-as-possible equality, so they overflow into infinity without compromising as-good-as-possible equality.
If you want inf != inf, you can emulate it: 1e400 == 3e400 evaluates to true, but 1e400 - 3e400 == 0 evaluates to false, because the result of +inf + -inf is NaN. (Arguably you could say it should evaluate to 0, but that would serve nobody's interest.)
Background
In C, according to the IEEE 754 binary floating point standard (so, if you use a float or a double) you're going to get an exact value that can be compared exactly with another variable of the same type. Well, this is true unless your computations result in a value that lies outside the range of integers that can be represented (i.e., overflow).
Why is Infinity == Infinity
resA and resB
The IEEE-754 standard tailored the values of infinity and negative infinity to be greater than or less than, respectively, all other values that may be represented according to the standard (<= INFINITY == 0 11111111111 0000000000000000000000000000000000000000000000000000 and >= -INFINITY == 1 11111111111 0000000000000000000000000000000000000000000000000000), except for NaN, which is neither less than, equal to, or greater than any floating point value (even itself). Take note that infinity and it's negative have explicit definitions in their sign, exponent, and mantissa bits.
So, resA and resB are infinity and since infinity is explicitly defined and reproducible, resA==resB. I'm fairly certain this is how isinf() is implemented.
Why is NaN != NaN
However, NaN is not explicitly defined. A NaN value has a sign bit of 0, exponent bits of all 1s (just like infinity and it's negative), and any set of non-zero fraction bits (Source). So, how would you tell one NaN from another, if their fraction bits are arbitrary anyways? Well, the standard doesn't assume that and simply returns false when two floating point values of this structure are compared to one another.
More Explanation
Because infinity is an explicitly defined value (Source, GNU C Manual):
Infinities propagate through calculations as one would expect
2 + ∞ = ∞
4 ÷ ∞ = 0
arctan (∞) = π/2.
However, NaN may or may not propagate through propagate through computations. When it does, it is a QNan (Quieting NaN, most significant fraction bit set) and all computations will result in NaN. When it doesn't, it is a SNan (Signalling NaN, most significant fraction bit not set) and all computations will result in an error.
There are many arithmetic systems. Some of them, including the ones normally covered in high school mathematics, such as the real numbers, do not have infinity as a number. Others have a single infinity, for example the projectively extended real line. Others, such as the IEEE floating point arithmetic under discussion, and the extended real line, have both positive and negative infinity.
IEEE754 arithmetic is different from real number arithmetic in many ways, but is a useful approximation for many purposes.
There is logic to the different treatment of NaNs and infinities. It is entirely reasonable to say that positive infinity is greater than negative infinity and any finite number. It would not be reasonable to say anything similar about the square root of -1.
myInt = int( 5 * myRandom() )
myRandom() is a randomly generated float, which should be 0.2.
So this statement should evaluate to be 1.
My question: is it possible that due to a floating point error it will NOT evaluate to 1?
For example if due to a floating point error something which should be 0.2 could that be LESS than that?
IE, for instance consider the following 3 possibilities:
int(5 * 0.2 ) = 1 //case 1 normal
int(5 * 0.2000000000000001 ) = 1 //case 2 slightly larger, its OK
int(5 * 0.1999999999999999 ) = 0 //case 3 negative, is NOT OK, as int() floors it
Is case3 even possible?, with 0.1999999999999999 be a result of a floating point error? I have never actually seen a negative epsilon so far, only case 2, when its a slightly bit larger, and thats OK, as when it is cast to int(), that 'floors' it to the correct result. However with a negative epsilon the 'flooring' effect will make the resulting 0.9999999999999996 evaluate to 0.
It is impossible for myRandom to return .2 because .2 is not representable as a float or a double, assuming your target system is using the IEEE 754 binary floating-point standard, which is overwhelmingly the default.
If myRandom() returns the representable number nearest .2, then myInt will be 1, because the number nearest .2 representable as a float is slightly greater than .2 (it is 0.20000000298023223876953125), and so is the nearest representable double (0.20000000000000001110223024625156540423631668090820312).
In other cases, this will not be true. E.g., the nearest double to .6 is 0.59999999999999997779553950749686919152736663818359375, so myInt will be 2, not 3.
Yes, it's possible, at least as far as the C standard is concerned.
The value 0.2 cannot be represented exactly in a binary floating-point format. The value returned by myRandom() will therefore be either slightly below, or slightly above, the mathematical value 0.2. The C standard permits either result.
Now it may well be that IEEE semantics only permit the result to be slightly greater than 0.2 -- but the C standard doesn't require IEEE semantics. And that's assuming that the result is derived as exactly as possible from the value 0.2. If the value is generated from a series of floating-point operations, each of which can introduce a small error, it could easily be either less than or greater than 0.2.
It's not a floating point error, it's the way floating point works. Any fraction that isn't 1/(power of 2) can't be exactly represented, and will be rounded either up or down to the nearest representable number.
You can fix your code by multiplying by some small epsilon greater than one before converting to integer.
myInt = int( 5 * myRandom() * 1.000000000000001 )
See What Every Computer Scientist Should Know About Floating-Point Arithmetic.
It's possible, depending on the number you choose.
To check a specific number you can always print them with a lot of precision: printf("%1.50f", 0.2)
why not multiply your float by 5.0 and then use the round function to properly round it?
I have the following C program:
#include <stdio.h>
int main()
{
double x=0;
double y=0/x;
if (y==1)
printf("y=1\n");
else
printf("y=%f\n",y);
if (y!=1)
printf("y!=1\n");
else
printf("y=%f\n",y);
return 0;
}
The output I get is
y=nan
y!=1
But when I change the line
double x=0;
to
int x=0;
the output becomes
Floating point exception
Can anyone explain why?
You're causing the division 0/0 with integer arithmetic (which is invalid, and produces the exception you see). Regardless of the type of y, what's evaluated first is 0/x.
When x is declared to be a double, the zero is converted to a double as well, and the operation is performed using floating-point arithmetic.
When x is declared to be an int, you are dividing one int 0 by another, and the result is not valid.
Because due to IEEE 754, NaN will be produced when conducting an illegal operation on floating point numbers (e.g. 0/0, ∞×0, or sqrt(−1)).
There are actually two kinds of NaNs, signaling and quiet. Using a
signaling NaN in any arithmetic operation (including numerical
comparisons) will cause an "invalid" exception. Using a quiet NaN
merely causes the result to be NaN too.
The representation of NaNs specified by the standard has some
unspecified bits that could be used to encode the type of error; but
there is no standard for that encoding. In theory, signaling NaNs
could be used by a runtime system to extend the floating-point numbers
with other special values, without slowing down the computations with
ordinary values. Such extensions do not seem to be common, though.
Also, Wikipedia says this about integer division by zero:
Integer division by zero is usually handled differently from floating
point since there is no integer representation for the result. Some
processors generate an exception when an attempt is made to divide an
integer by zero, although others will simply continue and generate an
incorrect result for the division. The result depends on how division
is implemented, and can either be zero, or sometimes the largest
possible integer.
There's a special bit-pattern in IEE754 which indicates NaN as the result of floating point division by zero errors.
However there's no such representation when using integer arithmetic, so the system has to throw an exception instead of returning NaN.
Check the min and max values of an integer data type. You will see that an undefined or nan result is not in it's range.
And read this what every computer scientist should know about floating point.
Integer division by 0 is illegal and is not handled. Float values on the other hand are handled in C using NaN. The following how ever would work.
int x=0;
double y = 0.0 / x;
If you divide int to int you can divide by 0.
0/0 in doubles is NaN.
int x=0;
double y=0/x; //0/0 as ints **after that** casted to double. You can use
double z=0.0/x; //or
double t=0/(double)x; // to avoid exception and get NaN
Floating point is inherently modeling the reals to limited precision. There are only a finite number of bit-patterns, but an infinite (continuous!) number of reals. It does its best of course, returning the closest representable real to the exact inputs it is given. Answers that are too small to be directly represented are instead represented by zero. Dividing by zero is an error in the real numbers. In floating point, however, because zero can arise from these very small answers, it can be useful to consider x/0.0 (for positive x) to be "positive infinity" or "too big to be represented". This is no longer useful for x = 0.0.
The best we could say is that dividing zero by zero is really "dividing something small that can't be told apart from zero by something small that can't be told apart from zero". What the answer to this? Well, there is no answer for the exact case of 0/0, and there is no good way of treating it inexactly. It would depend on the relative magnitudes, and so the processor basically shrugs and says "I lost all precision -- any result I gave you would be misleading", by returning Not a Number.
In contrast, when doing an integer divide by zero, the divisor really can only mean precisely zero. There's no possible way to give a consistent meaning to it, so when your code asks for the answer, it really is doing something illegitimate.
(It's an integer division in the second case, but not the first because of the promotion rules of C. 0 can be taken as an integer literal, and as both sides are integers, the division is integer division. In the first case, the fact that x is a double causes the dividend to be promoted to double. If you replace the 0 by 0.0, it will be a floating-point division, no matter the type of x.)