Can bad stuff happen when dividing 1/a very small float? - c

If I want to check that positive float A is less than the inverse square of another positive float B (in C99), could something go wrong if B is very small?
I could imagine checking it like
if(A<1/(B*B))
But if B is small enough, would this possibly result in infinity? If that were to happen, would the code still work correctly in all situations?
In a similar vein, I might do
if(1/A>B*B)
... which might be slightly better because B*B might be zero if B is small (is this true?)
Finally, a solution that I can't imagine being wrong is
if(sqrt(1/A)>B)
which I don't think would ever result in zero division, but still might be problematic if A is close to zero.
So basically, my questions are:
Can 1/X ever be infinity if X is greater than zero (but small)?
Can X*X ever be zero if X is greater than zero?
Will comparisons with infinity work the way I would expect them to?
EDIT: for those of you who are wondering, I ended up doing
if(B*A*B<1)
I did it in that order as it is visually unambiguous which multiplication occurs first.

If you want to handle the entire range of possible values of A and B, then you need to be a little bit careful, but this really isn't too complicated.
The suggestion of using a*b*b < 1. is a good one; if b is so tiny that a*b*b underflows to zero, then a is necessarily smaller than 1./(b*b). Conversely, if b is so large that a*b*b overflows to infinity, then the condition will (correctly) not be satisfied. (Potatoswatter correctly points out in a comment on another post that this does not work properly if you write it b*b*a, because b*b might overflow to infinity even when the condition should be true, if a happens to be denormal. However, in C, multiplication associates left-to-right, so that is not an issue if you write it a*b*b and your platform adheres to a reasonable numerics model.)
Because you know a priori that a and b are both positive numbers, there is no way for a*b*b to generate a NaN, so you needn't worry about that condition. Overflow and underflow are the only possible misbehaviors, and we have accounted for them already. If you needed to support the case where a or b might be zero or infinity, then you would need to be somewhat more careful.
To answer your direct questions: (answers assume IEEE-754 arithmetic)
Can 1/X ever be infinity if X is greater than zero (but small)?
Yes! If x is a small positive denormal value, then 1/x can overflow and produce infinity. For example, in double precision in the default rounding mode, 1 / 0x1.0p-1024 will overflow.
Can X*X ever be zero if X is greater than zero?
Yes! In double precision in the default rounding mode, all values of x smaller than 0x1.0p-538 (thats 2**-578 in the C99 hex format) or so have this property.
Will comparisons with infinity work the way I would expect them to?
Yes! This is one of the best features of IEEE-754.

OK, reposting as an answer.
Try using arithmetically equivalent comparison like if ( A*B*B < 1. ). You might get in trouble with really big numbers though.
Take a careful look at the IEEE 754 for your corner cases.

You want to avoid divisions so the trick is to modify the equation. You can multiply both sides of your first equation by (b*b) to get:
b*b*a < 1.0
This won't have any divisions so should be ok.

Division per se isn't so bad. However, standard IEEE 754 FP types allow for a greater negative negative range of exponents than positive, due to denormalized numbers. For example, float ranges from 1.4×10-45 to 3.4×10-38, so you cannot take the inverse of 2×10-44.
Therefore, as Jeremy suggests, start by multiplying A by B, where one has a positive exponent and the other has a negative exponent, to avoid overflow.
This is why A*B*B<1 is the proper answer.

Related

Why does comparison of floating-point to infinity work?

As anyone knowledgeable enough will know, you cannot compare two floating-point numbers with simple logic operators and expect a logical result.
Why then, does a logical EQUAL TO comparison to INFINITY always return true when the number is, in fact, INFINITY?
PS: I tried the same comparison with NAN, but it seems only INFINITY can be reliably detected as being equal to another number of the same value.
(Edit) Here is my implementation's (GCC) definition of INFINITY and NAN:
# define INFINITY (__builtin_inff ())
# define NAN (__builtin_nanf (""))
As anyone knowledgeable enough will know, you cannot compare two floating-point numbers with simple logic operators and expect a logical result.
This has no basis in the IEEE 754 standard or any other specification of floating-point behavior I am aware of. It is an unfortunately common misstatement of floating-point arithmetic.
The fact is that comparison for equality is a perfect operation in floating-point: It produces true if and only if the two operands represent the same number. There is never any error in a comparison for equality.
Another misstatement is that floating-point numbers approximate real numbers. Per IEEE 754, each floating-point value other than NaN represents one number, and it represents that number exactly.
The fact is that floating-point numbers are exact while floating-point operations approximate real arithmetic; correctly-rounded operations produce the nearest representable value (nearest in any direction or in a selected direction, with various rules for ties).
This distinction is critical for understanding, analyzing, designing, and writing proofs about floating-point arithmetic.
Why then, does a logical EQUAL TO comparison to INFINITY always return true when the number is, in fact, INFINITY?
As stated above, comparison for equality produces true if and only if its operands represent the same number. If x is infinity, then x == INFINITY returns true. If x is three, then x == 3 returns true.
People sometimes run into trouble when they do not understand what value is in a number. For example, in float x = 3.3;, people sometimes do not realize that C converts the double 3.3 to float, and therefore x does not contain the same value as 3.3. This is because the conversion operation approximates its results, not because the value of x is anything other than its specific assigned value.
I tried the same comparison with NAN,…
A NaN is Not a Number, so, in a comparison for equality, it never satisfies “the two operands represent the same number”, so the comparison produces false.
You cannot compare two floating-point numbers with simple logic operators and expect a logical result.
There are traps surrounding comparing two floating-point numbers because they might not be exactly what you expect due to rounding error due lack of precision. Sometimes, a number is very close to but not exactly what we expect. And this is what causes problems.
This problem doesn't apply to +Inf. +Inf can't accidentally be stored as some number very close to +Inf. +Inf is a specific bit pattern. It's not something like INT_MAX; it's a value that's special to the processor itself.
Same goes for -Inf.

Why adding 1.0/3.0 three times works just as mathematically expected?

I am aware that real numbers cannot be represented exactly in binary (even though with so-called double precision) in most cases. For example, 1.0/3.0 is approximated by 0x3fd5555555555555, which actually represents 0.33333333333333331483.... If we perform (1.0/3.0)+(1.0/3.0) then we obtain 0x3fe5555555555555 (so 0.66666666666666662965...), just as expected in a sense of computer arithmetic.
However, when I tried to perform (1.0/3.0)+(1.0/3.0)+(1.0/3.0) by writing the following code
#include<stdio.h>
int main(){
double result=1.0/3.0;
result+=1.0/3.0;
result+=1.0/3.0;
printf("%016llx\n",result);
}
and compiling it with the standard GNU C compiler, then the resulting program returned 0x3ff0000000000000 (which represents exactly 1). This result made me confused, because I initially expected 0x3fefffffffffffff (I did not expect rounding error to cancel each other because both (1.0/3.0) and ((1.0/3.0)+(1.0/3.0)) are smaller than actual value when represented in binary), and I still have not figured out what happened.
I would be grateful if you let me know possible reasons for this result.
There is no need to consider 80 bit representation - the results are the same in Java which requires, except for some irrelevant edge cases, the same behavior as IEEE 754 64-bit binary arithmetic for its doubles.
The exact value of 1.0/3.0 is 0.333333333333333314829616256247390992939472198486328125
As long as all numbers involved are in the normal range, multiplying or dividing by a power of two is exact. It only changes the exponent, not the significand. In particular, adding 1.0/3.0 to itself is exact, so the result of the first addition is 0.66666666666666662965923251249478198587894439697265625
The second addition does involve rounding. The exact sum is 0.99999999999999988897769753748434595763683319091796875, which is bracketed by representable numbers 0.999999999999999944488848768742172978818416595458984375 and 1.0. The exact value is half way between the bracketing numbers. A single bit has to be dropped. The least significant bit of 1.0 is a zero, so that is the rounded result of the addition.
That is a good rounding question. If I correctly remember, the arithmetic coprocessor uses 80 bits: 64 precision bits and 15 for the exponent (ref.). That means that internally the operation uses more bits than you can display. And in the end the coprocessor actually rounds its internal representation (more accurate) to give a 64 bit only value. And as the first bit dropped is 1 and not 0, the result is rounded upside giving 1.
But I must admit I am just guessing here...
But if you try to do by hand the operation, if immediately comes that the addition sets all precision bits to 1 (adding 5555...5 and 555...5 shifted by 1) plus the first bit to drop which is also 1. So by hand a normal human being would round upside also giving 1, so it is no surprise that the arithmetic unit is also able to do the correct rounding.

How unreliable are floating point values, operators and functions?

I don't want to introduce floating point when an inexact value would be a distaster, so I have a couple of questions about when you actually can use them safely.
Are they exact for integers as long as you don't overflow the number of significant digit? Are these two tests always true:
double d = 2.0;
if (d + 3.0 == 5.0) ...
if (d * 3.0 == 6.0) ...
What math functions can you rely on? Are these tests always true:
#include <math.h>
double d = 100.0;
if (log10(d) == 2.0) ...
if (pow(d, 2.0) == 10000.0) ...
if (sqrt(d) == 10.0) ...
How about this:
int v = ...;
if (log2((double) v) > 16.0) ... /* gonna need more than 16 bits to store v */
if (log((double) v) / log(2.0) > 16.0) ... /* C89 */
I guess you can summarize this question as: 1) Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h? 2) Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
I too find incorrect results distasteful.
On common hardware, you can rely on +, -, *, /, and sqrt working and delivering the correctly-rounded result. That is, they deliver the floating-point number closest to the sum, difference, product, quotient, or square root of their argument or arguments.
Some library functions, notably log2 and log10 and exp2 and exp10, traditionally have terrible implementations that are not even faithfully-rounded. Faithfully-rounded means that a function delivers one of the two floating-point numbers bracketing the exact result. Most modern pow implementations have similar issues. Lots of these functions will even blow exact cases like log10(10000) and pow(7, 2). Thus equality comparisons involving these functions, even in exact cases, are asking for trouble.
sin, cos, tan, atan, exp, and log have faithfully-rounded implementations on every platform I've recently encountered. In the bad old days, on processors using the x87 FPU to evaluate sin, cos, and tan, you would get horribly wrong outputs for largish inputs and you'd get the input back for larger inputs. CRlibm has correctly-rounded implementations; these are not mainstream because, I'm told, they've got rather nastier worst cases than the traditional faithfully-rounded implementations.
Things like copysign and nextafter and isfinite all work correctly. ceil and floor and rint and friends always deliver the exact result. fmod and friends do too. frexp and friends work. fmin and fmax work.
Someone thought it would be a brilliant idea to make fma(x,y,z) compute x*y+z by computing x*y rounded to a double, then adding z and rounding the result to a double. You can find this behaviour on modern platforms. It's stupid and I hate it.
I have no experience with the hyperbolic trig, gamma, or Bessel functions in my C library.
I should also mention that popular compilers targeting 32-bit x86 play by a different, broken, set of rules. Since the x87 is the only supported floating-point instruction set and all x87 arithmetic is done with an extended exponent, computations that would induce an underflow or overflow in double precision may fail to underflow or overflow. Furthermore, since the x87 also by default uses an extended significand, you may not get the results you're looking for. Worse still, compilers will sometimes spill intermediate results to variables of lower precision, so you can't even rely on your calculations with doubles being done in extended precision. (Java has a trick for doing 64-bit math with 80-bit registers, but it is quite expensive.)
I would recommend sticking to arithmetic on long doubles if you're targeting 32-bit x86. Compilers are supposed to set FLT_EVAL_METHOD to an appropriate value, but I do not know if this is done universally.
Can floating point types hold the exact value of all integers up to the number of their significant digits in float.h?
Well, they can store the integers which fit in their mantissa (significand). So [-2^53, 2^53] for double. For more on this, see: Which is the first integer that an IEEE 754 float is incapable of representing exactly?
Do all floating point operators and functions guarantee that the result is the closest to the actual mathematical result?
They at least guarantee that the result is immediately on either side of the actual mathematical result. That is, you won't get a result which has a valid floating point value between itself and the "actual" result. But beware, because repeated operations may accumulate an error which seems counter to this, while it is not (because all intermediate values are subject to the same constraints, not just the inputs and output of a compound expression).

Floating Point Square Root Reciprocal Method Correct Rounding

I have implemented a 32-bit IEEE-754 Floating Point Square Root using the Newton-Raphson method (in assembly) based upon finding the reciprocal of the square root.
I am using the round-to-nearest rounding method.
My square root method only accepts normalized values and zeros, but no denormalized values or special values (NaN, Inf, etc.)
I am wondering how I can ACHIEVE correct rounding (with assembly like instructions) so that my results are correct (to IEEE-754) for all inputs?
Basically, I know how to test if my results are correct, but I want to adjust the algorithm below so that I obtain correctly rounded results. What instructions should I add to the algorithm?
See: Determining Floating Point Square Root
for more information
Thank you!
There are only about 2 billion floats matching your description. Try them all, compare against sqrtf from your C library, and examine all differences. You can get a higher-precision square root using sqrt or sqrtl from your C library if you are worried. sqrt, sqrtf, and sqrtl are correctly-rounded by typical C libraries, though, so a direct comparison ought to work.
Why not square the result, and if it's not equal to the input, add or subtract (depending on the sign of the difference) a least significant bit, square, and check whether that would have given a better result?
Better here could mean with less absolute difference. The only case where this could get tricky is when "crossing" √2 with the mantissa, but this could be checked once and for all.
EDIT
I realize that the above answer is insufficient. Simply squaring in 32-bit FP and comparing to the input doesn't give you enough information. Let's say y = your_sqrt(x). You compare y2 to x, find that y2>x, subtract 1 LSB from y obtaining z (y1 in your comments), then compare z2 to x and find that not only z2<x, but, within the available bits, y2-x==x-z2 - how do you choose between y and z? You should either work with all the bits (I guess this is what you were looking for), or at least with more bits (which I guess is what njuffa is suggesting).
From a comment of yours I suspect you are on strictly 32-bit hardware, but let me suppose that you have a 32-bit by 32-bit integer multiplication with 64-bit result available (if not, it can be constructed). If you take the 23 bits of the mantissa of y as an integer, put a 1 in front, and multiply it by itself, you have a number that, except for a possible extra shift by 1, you can directly compare to the mantissa of x treated the same way. This way you have all 48 bits available for the comparison, and can decide without any approximation whether abs(y2-x)≷abs(z2-x).
If you are not sure to be within one LSB from the final result (but you are sure not to be much farther than that), you should repeat the above until y2-x changes sign or hits 0. Watch out for edge cases, though, which should essentially be the cases when the exponent is adjusted because the mantissa crosses a power of 2.
It can also be helpful to remember that positive floating point numbers can be correctly compared as integers, at least on those machines where 1.0F is 0x3f800000.

C: Adding Exponentials

What I thought was a trivial addition in standard C code compiled by GCC has confused me somewhat.
If I have a double called A and also a double called B, and A = a very small exponential say 1e-20 and B is a larger value for example 1e-5 - why does my double C which equals the summation A+B take on the dominant value B? I was hoping that when I specify to print to 25 decimal places I would get 1.00000000000000100000e-5.
Instead what I get is just 1.00000000000000000000e-5. Do I have to use long double or something else?
Very confused, and an easy question for most to answer I'm sure! Thanks for any guidance in advance.
Yes, there is not enough precision in the double mantissa. 2^53 (the precision of the double mantissa) is only slightly larger than 10^15 (the ratio between 10^20 and 10^5) so binary expansion and round off can easily squash small bits at the end.
http://en.wikipedia.org/wiki/Double-precision_floating-point_format
Google is your friend etc.
Floating point variables can hold a bigger range of value than fixed point, however their precision on significant digit has limits.
You can represent very big or very small numbers but the precision is dependent on the number of significant digit.
If you try to make operation between numbers very far in terms of exponent used to express them, the ability to work with them depends on the ability to represent them with the same exponent.
In your case when you try to sum the two numbers, the smaller numbers is matched in exponent with the bigger one, resulting in a 0 because its significant digit is out of range.
You can learn more for example on wiki

Resources