Why does comparison of floating-point to infinity work? - c

As anyone knowledgeable enough will know, you cannot compare two floating-point numbers with simple logic operators and expect a logical result.
Why then, does a logical EQUAL TO comparison to INFINITY always return true when the number is, in fact, INFINITY?
PS: I tried the same comparison with NAN, but it seems only INFINITY can be reliably detected as being equal to another number of the same value.
(Edit) Here is my implementation's (GCC) definition of INFINITY and NAN:
# define INFINITY (__builtin_inff ())
# define NAN (__builtin_nanf (""))

As anyone knowledgeable enough will know, you cannot compare two floating-point numbers with simple logic operators and expect a logical result.
This has no basis in the IEEE 754 standard or any other specification of floating-point behavior I am aware of. It is an unfortunately common misstatement of floating-point arithmetic.
The fact is that comparison for equality is a perfect operation in floating-point: It produces true if and only if the two operands represent the same number. There is never any error in a comparison for equality.
Another misstatement is that floating-point numbers approximate real numbers. Per IEEE 754, each floating-point value other than NaN represents one number, and it represents that number exactly.
The fact is that floating-point numbers are exact while floating-point operations approximate real arithmetic; correctly-rounded operations produce the nearest representable value (nearest in any direction or in a selected direction, with various rules for ties).
This distinction is critical for understanding, analyzing, designing, and writing proofs about floating-point arithmetic.
Why then, does a logical EQUAL TO comparison to INFINITY always return true when the number is, in fact, INFINITY?
As stated above, comparison for equality produces true if and only if its operands represent the same number. If x is infinity, then x == INFINITY returns true. If x is three, then x == 3 returns true.
People sometimes run into trouble when they do not understand what value is in a number. For example, in float x = 3.3;, people sometimes do not realize that C converts the double 3.3 to float, and therefore x does not contain the same value as 3.3. This is because the conversion operation approximates its results, not because the value of x is anything other than its specific assigned value.
I tried the same comparison with NAN,…
A NaN is Not a Number, so, in a comparison for equality, it never satisfies “the two operands represent the same number”, so the comparison produces false.

You cannot compare two floating-point numbers with simple logic operators and expect a logical result.
There are traps surrounding comparing two floating-point numbers because they might not be exactly what you expect due to rounding error due lack of precision. Sometimes, a number is very close to but not exactly what we expect. And this is what causes problems.
This problem doesn't apply to +Inf. +Inf can't accidentally be stored as some number very close to +Inf. +Inf is a specific bit pattern. It's not something like INT_MAX; it's a value that's special to the processor itself.
Same goes for -Inf.

Related

IEEE Floating Point Numbers: Are they not "precise"? (read: well-defined)

For a long time I thought floating point arithmetic is well-defined and different platforms making the same calculation should get the same results. (Given the same rounding modes.)
Yet Microsoft's SQL Server deems any calculation performed with floating point to have an "imprecise" result, meaning you can't have an index on it. This suggests to me that the guys at Microsoft thought there's a relevant catch regarding floating point.
So what is the catch?
EDIT: You can have indexes on floats in general, just not on computed columns. Also, geodata uses floating point types for the coordinates and those are obviously indexed, too. So it's not a problem of reliable comparison in general.
Floating-point arithmetic is well defined by the IEEE 754 standard. In the documentation you point out, Microsoft has apparently not chosen to adhere to the standard.
There are a variety of issues that makes floating-point reproducibility difficult, and you can find Stack Overflow discussions about them by searching for “[floating-point] reproducibility”. However, most of these issues are about lack of control in high-level languages (the individual floating-point operations are completely reproducible and specified by IEEE 754, and the hardware provides sufficient IEEE 754 conformance, but the high-level language specification does not adequately map language constructs to specific floating-point operations), differences in math library routines (functions such as sin and log are “hard” to compute in some sense, and vendors implement them without what is called correct rounding, so each vendor’s routines have slightly different error characteristics than others), multithreading and other issues allow operations to be performed in different orders, thus yielding different results, and so on.
In a single system such as Microsoft’s SQL Server, Microsoft presumably could have controlled these issues if they wanted to. Still, there are issues to consider. For example, a database system may have a sum function that computes the sum of many things. For speed, you may wish the sum implementation to have the flexibility to add the elements in any order, so that it can take advantage of multiprocessing or of adding the elements in whatever order they happen to be stored in. But adding floating-point data using elementary add operations of the same floating-point format has different results depending on the order of the elements. To make the sum reproducible, you have to specify the order of operation or use extra precision or other techniques, and then performance suffers.
So, not making floating-point arithmetic is a choice that is made, not a consequence of any lack of specification for floating-point arithmetic.
Another problem for database purposes is that even well defined and completely specified floating-point arithmetic has NaN values. (NaN, an abbreviation for Not a Number, represents a floating-point datum that is not a number. A NaN is produced as the result of an operation that has no mathematical result (such as the real square root of a negative number). NaNs act as placeholders so that floating-point operations can continue without interruption, and an application can complete a set of floating-point operations and then take action to replace or otherwise deal with any NaNs that arose.) Because a NaN does not represent a number, it is not equal to anything, not even itself. Comparing two NaNs for equality produces false, even if the NaNs are represented with exactly the same bits. This is a problem for databases, because NaNs cannot be used as a key for looking up records because a NaN search key will never equal a NaN in the key field of a record. Sometimes this is deal with by defining two different ordering relations—one is the usual mathematical comparison, which defines less than, equal to, and greater than for numbers (and for which all three are false for NaNs), and a second which defines a sort order and which is defined for all data, including NaNs.
It should be noted that each floating-point datum that is not a NaN represents a certain number exactly. There is no imprecision in a floating-point number. A floating-point number does not represent an interval. Floating-point operations approximate real arithmetic in that they return values approximately equal to the exact mathematical results, while floating-point numbers are exact. Elementary floating-point operations are exactly specified by IEEE 754. Lack of reproducibility arises in using different operations (including the same operations with different precisions), in using operations in different orders, or in using operations that do not conform to the IEEE 754 standard.

Why adding 1.0/3.0 three times works just as mathematically expected?

I am aware that real numbers cannot be represented exactly in binary (even though with so-called double precision) in most cases. For example, 1.0/3.0 is approximated by 0x3fd5555555555555, which actually represents 0.33333333333333331483.... If we perform (1.0/3.0)+(1.0/3.0) then we obtain 0x3fe5555555555555 (so 0.66666666666666662965...), just as expected in a sense of computer arithmetic.
However, when I tried to perform (1.0/3.0)+(1.0/3.0)+(1.0/3.0) by writing the following code
#include<stdio.h>
int main(){
double result=1.0/3.0;
result+=1.0/3.0;
result+=1.0/3.0;
printf("%016llx\n",result);
}
and compiling it with the standard GNU C compiler, then the resulting program returned 0x3ff0000000000000 (which represents exactly 1). This result made me confused, because I initially expected 0x3fefffffffffffff (I did not expect rounding error to cancel each other because both (1.0/3.0) and ((1.0/3.0)+(1.0/3.0)) are smaller than actual value when represented in binary), and I still have not figured out what happened.
I would be grateful if you let me know possible reasons for this result.
There is no need to consider 80 bit representation - the results are the same in Java which requires, except for some irrelevant edge cases, the same behavior as IEEE 754 64-bit binary arithmetic for its doubles.
The exact value of 1.0/3.0 is 0.333333333333333314829616256247390992939472198486328125
As long as all numbers involved are in the normal range, multiplying or dividing by a power of two is exact. It only changes the exponent, not the significand. In particular, adding 1.0/3.0 to itself is exact, so the result of the first addition is 0.66666666666666662965923251249478198587894439697265625
The second addition does involve rounding. The exact sum is 0.99999999999999988897769753748434595763683319091796875, which is bracketed by representable numbers 0.999999999999999944488848768742172978818416595458984375 and 1.0. The exact value is half way between the bracketing numbers. A single bit has to be dropped. The least significant bit of 1.0 is a zero, so that is the rounded result of the addition.
That is a good rounding question. If I correctly remember, the arithmetic coprocessor uses 80 bits: 64 precision bits and 15 for the exponent (ref.). That means that internally the operation uses more bits than you can display. And in the end the coprocessor actually rounds its internal representation (more accurate) to give a 64 bit only value. And as the first bit dropped is 1 and not 0, the result is rounded upside giving 1.
But I must admit I am just guessing here...
But if you try to do by hand the operation, if immediately comes that the addition sets all precision bits to 1 (adding 5555...5 and 555...5 shifted by 1) plus the first bit to drop which is also 1. So by hand a normal human being would round upside also giving 1, so it is no surprise that the arithmetic unit is also able to do the correct rounding.

How unreliable are floating point values, operators and functions?

I don't want to introduce floating point when an inexact value would be a distaster, so I have a couple of questions about when you actually can use them safely.
Are they exact for integers as long as you don't overflow the number of significant digit? Are these two tests always true:
double d = 2.0;
if (d + 3.0 == 5.0) ...
if (d * 3.0 == 6.0) ...
What math functions can you rely on? Are these tests always true:
#include <math.h>
double d = 100.0;
if (log10(d) == 2.0) ...
if (pow(d, 2.0) == 10000.0) ...
if (sqrt(d) == 10.0) ...
How about this:
int v = ...;
if (log2((double) v) > 16.0) ... /* gonna need more than 16 bits to store v */
if (log((double) v) / log(2.0) > 16.0) ... /* C89 */
I guess you can summarize this question as: 1) Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h? 2) Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
I too find incorrect results distasteful.
On common hardware, you can rely on +, -, *, /, and sqrt working and delivering the correctly-rounded result. That is, they deliver the floating-point number closest to the sum, difference, product, quotient, or square root of their argument or arguments.
Some library functions, notably log2 and log10 and exp2 and exp10, traditionally have terrible implementations that are not even faithfully-rounded. Faithfully-rounded means that a function delivers one of the two floating-point numbers bracketing the exact result. Most modern pow implementations have similar issues. Lots of these functions will even blow exact cases like log10(10000) and pow(7, 2). Thus equality comparisons involving these functions, even in exact cases, are asking for trouble.
sin, cos, tan, atan, exp, and log have faithfully-rounded implementations on every platform I've recently encountered. In the bad old days, on processors using the x87 FPU to evaluate sin, cos, and tan, you would get horribly wrong outputs for largish inputs and you'd get the input back for larger inputs. CRlibm has correctly-rounded implementations; these are not mainstream because, I'm told, they've got rather nastier worst cases than the traditional faithfully-rounded implementations.
Things like copysign and nextafter and isfinite all work correctly. ceil and floor and rint and friends always deliver the exact result. fmod and friends do too. frexp and friends work. fmin and fmax work.
Someone thought it would be a brilliant idea to make fma(x,y,z) compute x*y+z by computing x*y rounded to a double, then adding z and rounding the result to a double. You can find this behaviour on modern platforms. It's stupid and I hate it.
I have no experience with the hyperbolic trig, gamma, or Bessel functions in my C library.
I should also mention that popular compilers targeting 32-bit x86 play by a different, broken, set of rules. Since the x87 is the only supported floating-point instruction set and all x87 arithmetic is done with an extended exponent, computations that would induce an underflow or overflow in double precision may fail to underflow or overflow. Furthermore, since the x87 also by default uses an extended significand, you may not get the results you're looking for. Worse still, compilers will sometimes spill intermediate results to variables of lower precision, so you can't even rely on your calculations with doubles being done in extended precision. (Java has a trick for doing 64-bit math with 80-bit registers, but it is quite expensive.)
I would recommend sticking to arithmetic on long doubles if you're targeting 32-bit x86. Compilers are supposed to set FLT_EVAL_METHOD to an appropriate value, but I do not know if this is done universally.
Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h?
Well, they can store the integers which fit in their mantissa (significand). So [-2^53, 2^53] for double. For more on this, see: Which is the first integer that an IEEE 754 float is incapable of representing exactly?
Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
They at least guarantee that the result is immediately on either side of the actual mathematical result. That is, you won't get a result which has a valid floating point value between itself and the "actual" result. But beware, because repeated operations may accumulate an error which seems counter to this, while it is not (because all intermediate values are subject to the same constraints, not just the inputs and output of a compound expression).

Why dividing a float by a power of 10 is less accurate than typing the number directly?

When I run
printf("%.8f\n", 971090899.9008999);
printf("%.8f\n", 9710908999008999.0 / 10000000.0);
I get
971090899.90089989
971090899.90089977
I know why neither is exact, but what I don't understand is why doesn't the second match the first?
I thought basic arithmetic operations (+ - * /) were always as accurate as possible...
Isn't the first number a more accurate result of the division than the second?
Judging from the numbers you're using and based on the standard IEEE 754 floating point standard, it seems the left hand side of the division is too large to be completely encompassed in the mantissa (significand) of a 64-bit double.
You've got 52 bits worth of pure integer representation before you start bleeding precision. 9710908999008999 has ~54 bits in its representation, so it does not fit properly -- thus, the truncation and approximation begins and your end numbers get all finagled.
EDIT: As was pointed out, the first number that has no mathematical operations done on it doesn't fit either. But, since you're doing extra math on the second one, you're introducing extra rounding errors not present with the first number. So you'll have to take that into consideration too!
Evaluating the expression 971090899.9008999 involves one operation, a conversion from decimal to the floating-point format.
Evaluating the expression 9710908999008999.0 / 10000000.0 involves three operations:
Converting 9710908999008999.0 from decimal to the floating-point format.
Converting 10000000.0 from decimal to the floating-point format.
Dividing the results of the above operations.
The second of those should be exact in any good C implementation, because the result is exactly representable. However, the other two add rounding errors.
C does not require implementations to convert decimal to floating-point as accurately as possible; it allows some slack. However, a good implementation does convert accurately, using extra precision if necessary. Thus, the single operation on 971090899.9008999 produces a more accurate result than the multiple operations.
Additionally, as we learn from a comment, the C implementation used by the OP converts 9710908999008999.0 to 9710908999008998. This is incorrect by the rules of IEEE-754 for the common round-to-nearest mode. The correct result is 9710908999009000. Both of these candidates are representable in IEEE-754 64-bit binary, and both are equidistant from the source value, 9710908999008999. The usual rounding mode is round-to-nearest, ties-to-even, meaning the candidate with the even low bit should be selected, which is 9710908999009000 (with significand 0x1.1400298aa8174), not 9710908999008998 (with significand 0x1.1400298aa8173). (IEEE 754 defines another round-to-nearest mode: ties-to-away, which selects the candidate with the larger magnitude, which is again 9710908999009000.)
The C standard permits some slack in conversions; either of these two candidates conforms to the C standard, but good implementations also conform to IEEE 754.

Can bad stuff happen when dividing 1/a very small float?

If I want to check that positive float A is less than the inverse square of another positive float B (in C99), could something go wrong if B is very small?
I could imagine checking it like
if(A<1/(B*B))
But if B is small enough, would this possibly result in infinity? If that were to happen, would the code still work correctly in all situations?
In a similar vein, I might do
if(1/A>B*B)
... which might be slightly better because B*B might be zero if B is small (is this true?)
Finally, a solution that I can't imagine being wrong is
if(sqrt(1/A)>B)
which I don't think would ever result in zero division, but still might be problematic if A is close to zero.
So basically, my questions are:
Can 1/X ever be infinity if X is greater than zero (but small)?
Can X*X ever be zero if X is greater than zero?
Will comparisons with infinity work the way I would expect them to?
EDIT: for those of you who are wondering, I ended up doing
if(B*A*B<1)
I did it in that order as it is visually unambiguous which multiplication occurs first.
If you want to handle the entire range of possible values of A and B, then you need to be a little bit careful, but this really isn't too complicated.
The suggestion of using a*b*b < 1. is a good one; if b is so tiny that a*b*b underflows to zero, then a is necessarily smaller than 1./(b*b). Conversely, if b is so large that a*b*b overflows to infinity, then the condition will (correctly) not be satisfied. (Potatoswatter correctly points out in a comment on another post that this does not work properly if you write it b*b*a, because b*b might overflow to infinity even when the condition should be true, if a happens to be denormal. However, in C, multiplication associates left-to-right, so that is not an issue if you write it a*b*b and your platform adheres to a reasonable numerics model.)
Because you know a priori that a and b are both positive numbers, there is no way for a*b*b to generate a NaN, so you needn't worry about that condition. Overflow and underflow are the only possible misbehaviors, and we have accounted for them already. If you needed to support the case where a or b might be zero or infinity, then you would need to be somewhat more careful.
To answer your direct questions: (answers assume IEEE-754 arithmetic)
Can 1/X ever be infinity if X is greater than zero (but small)?
Yes! If x is a small positive denormal value, then 1/x can overflow and produce infinity. For example, in double precision in the default rounding mode, 1 / 0x1.0p-1024 will overflow.
Can X*X ever be zero if X is greater than zero?
Yes! In double precision in the default rounding mode, all values of x smaller than 0x1.0p-538 (thats 2**-578 in the C99 hex format) or so have this property.
Will comparisons with infinity work the way I would expect them to?
Yes! This is one of the best features of IEEE-754.
OK, reposting as an answer.
Try using arithmetically equivalent comparison like if ( A*B*B < 1. ). You might get in trouble with really big numbers though.
Take a careful look at the IEEE 754 for your corner cases.
You want to avoid divisions so the trick is to modify the equation. You can multiply both sides of your first equation by (b*b) to get:
b*b*a < 1.0
This won't have any divisions so should be ok.
Division per se isn't so bad. However, standard IEEE 754 FP types allow for a greater negative negative range of exponents than positive, due to denormalized numbers. For example, float ranges from 1.4×10-45 to 3.4×10-38, so you cannot take the inverse of 2×10-44.
Therefore, as Jeremy suggests, start by multiplying A by B, where one has a positive exponent and the other has a negative exponent, to avoid overflow.
This is why A*B*B<1 is the proper answer.

Resources