I have implemented a 32-bit IEEE-754 Floating Point Square Root using the Newton-Raphson method (in assembly) based upon finding the reciprocal of the square root.
I am using the round-to-nearest rounding method.
My square root method only accepts normalized values and zeros, but no denormalized values or special values (NaN, Inf, etc.)
I am wondering how I can ACHIEVE correct rounding (with assembly like instructions) so that my results are correct (to IEEE-754) for all inputs?
Basically, I know how to test if my results are correct, but I want to adjust the algorithm below so that I obtain correctly rounded results. What instructions should I add to the algorithm?
See: Determining Floating Point Square Root
for more information
Thank you!
There are only about 2 billion floats matching your description. Try them all, compare against sqrtf from your C library, and examine all differences. You can get a higher-precision square root using sqrt or sqrtl from your C library if you are worried. sqrt, sqrtf, and sqrtl are correctly-rounded by typical C libraries, though, so a direct comparison ought to work.
Why not square the result, and if it's not equal to the input, add or subtract (depending on the sign of the difference) a least significant bit, square, and check whether that would have given a better result?
Better here could mean with less absolute difference. The only case where this could get tricky is when "crossing" √2 with the mantissa, but this could be checked once and for all.
EDIT
I realize that the above answer is insufficient. Simply squaring in 32-bit FP and comparing to the input doesn't give you enough information. Let's say y = your_sqrt(x). You compare y2 to x, find that y2>x, subtract 1 LSB from y obtaining z (y1 in your comments), then compare z2 to x and find that not only z2<x, but, within the available bits, y2-x==x-z2 - how do you choose between y and z? You should either work with all the bits (I guess this is what you were looking for), or at least with more bits (which I guess is what njuffa is suggesting).
From a comment of yours I suspect you are on strictly 32-bit hardware, but let me suppose that you have a 32-bit by 32-bit integer multiplication with 64-bit result available (if not, it can be constructed). If you take the 23 bits of the mantissa of y as an integer, put a 1 in front, and multiply it by itself, you have a number that, except for a possible extra shift by 1, you can directly compare to the mantissa of x treated the same way. This way you have all 48 bits available for the comparison, and can decide without any approximation whether abs(y2-x)≷abs(z2-x).
If you are not sure to be within one LSB from the final result (but you are sure not to be much farther than that), you should repeat the above until y2-x changes sign or hits 0. Watch out for edge cases, though, which should essentially be the cases when the exponent is adjusted because the mantissa crosses a power of 2.
It can also be helpful to remember that positive floating point numbers can be correctly compared as integers, at least on those machines where 1.0F is 0x3f800000.
Related
I am aware that real numbers cannot be represented exactly in binary (even though with so-called double precision) in most cases. For example, 1.0/3.0 is approximated by 0x3fd5555555555555, which actually represents 0.33333333333333331483.... If we perform (1.0/3.0)+(1.0/3.0) then we obtain 0x3fe5555555555555 (so 0.66666666666666662965...), just as expected in a sense of computer arithmetic.
However, when I tried to perform (1.0/3.0)+(1.0/3.0)+(1.0/3.0) by writing the following code
#include<stdio.h>
int main(){
double result=1.0/3.0;
result+=1.0/3.0;
result+=1.0/3.0;
printf("%016llx\n",result);
}
and compiling it with the standard GNU C compiler, then the resulting program returned 0x3ff0000000000000 (which represents exactly 1). This result made me confused, because I initially expected 0x3fefffffffffffff (I did not expect rounding error to cancel each other because both (1.0/3.0) and ((1.0/3.0)+(1.0/3.0)) are smaller than actual value when represented in binary), and I still have not figured out what happened.
I would be grateful if you let me know possible reasons for this result.
There is no need to consider 80 bit representation - the results are the same in Java which requires, except for some irrelevant edge cases, the same behavior as IEEE 754 64-bit binary arithmetic for its doubles.
The exact value of 1.0/3.0 is 0.333333333333333314829616256247390992939472198486328125
As long as all numbers involved are in the normal range, multiplying or dividing by a power of two is exact. It only changes the exponent, not the significand. In particular, adding 1.0/3.0 to itself is exact, so the result of the first addition is 0.66666666666666662965923251249478198587894439697265625
The second addition does involve rounding. The exact sum is 0.99999999999999988897769753748434595763683319091796875, which is bracketed by representable numbers 0.999999999999999944488848768742172978818416595458984375 and 1.0. The exact value is half way between the bracketing numbers. A single bit has to be dropped. The least significant bit of 1.0 is a zero, so that is the rounded result of the addition.
That is a good rounding question. If I correctly remember, the arithmetic coprocessor uses 80 bits: 64 precision bits and 15 for the exponent (ref.). That means that internally the operation uses more bits than you can display. And in the end the coprocessor actually rounds its internal representation (more accurate) to give a 64 bit only value. And as the first bit dropped is 1 and not 0, the result is rounded upside giving 1.
But I must admit I am just guessing here...
But if you try to do by hand the operation, if immediately comes that the addition sets all precision bits to 1 (adding 5555...5 and 555...5 shifted by 1) plus the first bit to drop which is also 1. So by hand a normal human being would round upside also giving 1, so it is no surprise that the arithmetic unit is also able to do the correct rounding.
There are many questions (and answers) on this subject, but I am too thick to figure it out. In C, for a floating point of a given type, say double:
double x;
scanf("%lf", &x);
Is there a generic way to calculate an upper bound (as small as possible) for the error between the decimal fraction string passed to scanf and the internal representation of what is now in x?
If I understand correctly, there is sometimes going to be an error, and it will increase as the absolute value of the decimal fraction increases (in other words, 0.1 will be a bit off, but 100000000.1 will be off by much more).
This aspect of the C standard is slightly under-specified, but you can expect the conversion from decimal to double to be within one Unit in the Last Place of the original.
You seem to be looking for a bound on the absolute error of the conversion. With the above assumption, you can compute such a bound as a double as DBL_EPSILON * x. DBL_EPSILON is typically 2^-52.
A tighter bound on the error that can have been made during the conversion can be computed as follows:
double va = fabs(x);
double error = nextafter(va, +0./0.) - va;
The best conversion functions guarantee conversion to half a ULP in default round-to-nearest mode. If you are using conversion functions with this guarantee, you can divide the bound I offer by two.
The above applies when the original number represented in decimal is 0 or when its absolute value is comprised between DBL_MIN (approx. 2*10^-308) and DBL_MAX (approx. 2*10^308). If the non-null decimal number's absolute value is lower than DBL_MIN, then the absolute error is only bounded by DBL_MIN * DBL_EPSILON. If the absolute value is higher than DBL_MAX, you are likely to get infinity as the result of the conversion.
you cant think of this in terms of base 10, the error is in base 2, which wont necessarily point to a specific decimal place in base 10.
You have two underlying issues with your question, first scanf taking an ascii string and converting it to a binary number, that is one piece of software which uses a number of C libraries. I have seen for example compile time parsing vs runtime parsing give different conversion results on the same system. so in terms of error, if you want an exact number convert it yourself and place that binary number in the register/variable, otherwise accept what you get with the conversion and understand there may be rounding or clipping on the conversion that you didnt expect (which results in an accuracy issue, you didnt get the number you expected).
the second and real problem Pascal already answered. you only have x number if binary places. In terms of decimal if you had 3 decimal places the number 1.2345 would either have to be represented as 1.234 or 1.235. same for binary if you have 3 bits of mantissa then 1.0011 is either 1.001 or 1.010 depending on rounding. the mantissa length for IEEE floating point numbers is well documented you can simply google to find how many binary places you have for each precision.
What I thought was a trivial addition in standard C code compiled by GCC has confused me somewhat.
If I have a double called A and also a double called B, and A = a very small exponential say 1e-20 and B is a larger value for example 1e-5 - why does my double C which equals the summation A+B take on the dominant value B? I was hoping that when I specify to print to 25 decimal places I would get 1.00000000000000100000e-5.
Instead what I get is just 1.00000000000000000000e-5. Do I have to use long double or something else?
Very confused, and an easy question for most to answer I'm sure! Thanks for any guidance in advance.
Yes, there is not enough precision in the double mantissa. 2^53 (the precision of the double mantissa) is only slightly larger than 10^15 (the ratio between 10^20 and 10^5) so binary expansion and round off can easily squash small bits at the end.
http://en.wikipedia.org/wiki/Double-precision_floating-point_format
Google is your friend etc.
Floating point variables can hold a bigger range of value than fixed point, however their precision on significant digit has limits.
You can represent very big or very small numbers but the precision is dependent on the number of significant digit.
If you try to make operation between numbers very far in terms of exponent used to express them, the ability to work with them depends on the ability to represent them with the same exponent.
In your case when you try to sum the two numbers, the smaller numbers is matched in exponent with the bigger one, resulting in a 0 because its significant digit is out of range.
You can learn more for example on wiki
If I want to check that positive float A is less than the inverse square of another positive float B (in C99), could something go wrong if B is very small?
I could imagine checking it like
if(A<1/(B*B))
But if B is small enough, would this possibly result in infinity? If that were to happen, would the code still work correctly in all situations?
In a similar vein, I might do
if(1/A>B*B)
... which might be slightly better because B*B might be zero if B is small (is this true?)
Finally, a solution that I can't imagine being wrong is
if(sqrt(1/A)>B)
which I don't think would ever result in zero division, but still might be problematic if A is close to zero.
So basically, my questions are:
Can 1/X ever be infinity if X is greater than zero (but small)?
Can X*X ever be zero if X is greater than zero?
Will comparisons with infinity work the way I would expect them to?
EDIT: for those of you who are wondering, I ended up doing
if(B*A*B<1)
I did it in that order as it is visually unambiguous which multiplication occurs first.
If you want to handle the entire range of possible values of A and B, then you need to be a little bit careful, but this really isn't too complicated.
The suggestion of using a*b*b < 1. is a good one; if b is so tiny that a*b*b underflows to zero, then a is necessarily smaller than 1./(b*b). Conversely, if b is so large that a*b*b overflows to infinity, then the condition will (correctly) not be satisfied. (Potatoswatter correctly points out in a comment on another post that this does not work properly if you write it b*b*a, because b*b might overflow to infinity even when the condition should be true, if a happens to be denormal. However, in C, multiplication associates left-to-right, so that is not an issue if you write it a*b*b and your platform adheres to a reasonable numerics model.)
Because you know a priori that a and b are both positive numbers, there is no way for a*b*b to generate a NaN, so you needn't worry about that condition. Overflow and underflow are the only possible misbehaviors, and we have accounted for them already. If you needed to support the case where a or b might be zero or infinity, then you would need to be somewhat more careful.
To answer your direct questions: (answers assume IEEE-754 arithmetic)
Can 1/X ever be infinity if X is greater than zero (but small)?
Yes! If x is a small positive denormal value, then 1/x can overflow and produce infinity. For example, in double precision in the default rounding mode, 1 / 0x1.0p-1024 will overflow.
Can X*X ever be zero if X is greater than zero?
Yes! In double precision in the default rounding mode, all values of x smaller than 0x1.0p-538 (thats 2**-578 in the C99 hex format) or so have this property.
Will comparisons with infinity work the way I would expect them to?
Yes! This is one of the best features of IEEE-754.
OK, reposting as an answer.
Try using arithmetically equivalent comparison like if ( A*B*B < 1. ). You might get in trouble with really big numbers though.
Take a careful look at the IEEE 754 for your corner cases.
You want to avoid divisions so the trick is to modify the equation. You can multiply both sides of your first equation by (b*b) to get:
b*b*a < 1.0
This won't have any divisions so should be ok.
Division per se isn't so bad. However, standard IEEE 754 FP types allow for a greater negative negative range of exponents than positive, due to denormalized numbers. For example, float ranges from 1.4×10-45 to 3.4×10-38, so you cannot take the inverse of 2×10-44.
Therefore, as Jeremy suggests, start by multiplying A by B, where one has a positive exponent and the other has a negative exponent, to avoid overflow.
This is why A*B*B<1 is the proper answer.
The question I meant to ask concerned the mantissa, not the exponent, and has lots to do with the question I asked earlier in the week regarding "missing" digits on the sum of two negative floats.
Given that the mantissa has a variable precision, how does one tell if one has overflowed the mantissa's current precision setting? Or, from the proactive side, how can one tell if mantissa precision overflow is likely?
Kind regards,
Bruce.
There are a few numerical methods to see if you're going to lose precision, but the bottom line is that you need to understand the definition of precision better.
-4939600281397002.2812
and
-4939600281397002.2812000000000000
are NOT the same number.
When you add
-2234.6016114467412141
and
-4939600281397002.2812
together, the correct output will only have 20 digits of precision, because the additional 12 digits in the smaller number are meaningless given that the 12 similarly sized digits in the larger number are unknown. You can imply that they are zero, but if that's the case then you must explicitly declare them to be as such, and use a numbering system that can handle it - the computer is not good at understanding implicit intentions.
As far as detecting when you are going to have this problem, all you need to do is find out if they have the same exponent (assuming a normalized mantissa +/- 1 or similar binary equivalent). If they aren't normalized, then you'll need to normalize them to compare, or use a slightly more complex comparison with the exponent.
Precision and accuracy are not the same thing...
-Adam