Efficient floating-point division with constant integer divisors

Efficient floating-point division with constant integer divisors - c

A recent question, whether compilers are allowed to replace floating-point division with floating-point multiplication, inspired me to ask this question.
Under the stringent requirement, that the results after code transformation shall be bit-wise identical to the actual division operation,
it is trivial to see that for binary IEEE-754 arithmetic, this is possible for divisors that are a power of two. As long as the reciprocal
of the divisor is representable, multiplying by the reciprocal of the divisor delivers results identical to the division. For example, multiplication by 0.5 can replace division by 2.0.
One then wonders for what other divisors such replacements work, assuming we allow any short instruction sequence that replaces division but runs significantly faster, while delivering bit-identical results. In particular allow fused multiply-add operations in addition to plain multiplication.
In comments I pointed to the following relevant paper:
Nicolas Brisebarre, Jean-Michel Muller, and Saurabh Kumar Raina. Accelerating correctly rounded floating-point division when the divisor is known in advance. IEEE Transactions on Computers, Vol. 53, No. 8, August 2004, pp. 1069-1072.
The technique advocated by the authors of the paper precomputes the reciprocal of the divisor y as a normalized head-tail pair zh:zl as follows: zh = 1 / y, zl = fma (-y, zh, 1) / y. Later, the division q = x / y is then computed as q = fma (zh, x, zl * x). The paper derives various conditions that divisor y must satisfy for this algorithm to work. As one readily observes, this algorithm has problems with infinities and zero when the signs of head and tail differ. More importantly, it will fail to deliver correct results for dividends x that are very small in magnitude, because computation of the quotient tail, zl * x, suffers from underflow.
The paper also makes a passing reference to an alternative FMA-based division algorithm, pioneered by Peter Markstein when he was at IBM. The relevant reference is:
P. W. Markstein. Computation of elementary functions on the IBM RISC System/6000 processor. IBM Journal of Research & Development, Vol. 34, No. 1, January 1990, pp. 111-119
In Markstein's algorithm, one first computes a reciprocal rc, from which an initial quotient q = x * rc is formed. Then, the remainder of the division is computed accurately with an FMA as r = fma (-y, q, x), and an improved, more accurate quotient is finally computed as q = fma (r, rc, q).
This algorithm also has issues for x that are zeroes or infinities (easily worked around with appropriate conditional execution), but exhaustive testing using IEEE-754 single-precision float data shows that it delivers the correct quotient across all possibe dividends x for many divisors y, among these many small integers. This C code implements it:
/* precompute reciprocal */
rc = 1.0f / y;
/* compute quotient q=x/y */
q = x * rc;
if ((x != 0) && (!isinf(x))) {
r = fmaf (-y, q, x);
q = fmaf (r, rc, q);
}
On most processor architectures, this should translate into a branchless sequence of instructions, using either predication, conditional moves, or select-type instructions. To give a concrete example: For division by 3.0f, the nvcc compiler of CUDA 7.5 generates the following machine code for a Kepler-class GPU:
LDG.E R5, [R2]; // load x
FSETP.NEU.AND P0, PT, |R5|, +INF , PT; // pred0 = fabsf(x) != INF
FMUL32I R2, R5, 0.3333333432674408; // q = x * (1.0f/3.0f)
FSETP.NEU.AND P0, PT, R5, RZ, P0; // pred0 = (x != 0.0f) && (fabsf(x) != INF)
FMA R5, R2, -3, R5; // r = fmaf (q, -3.0f, x);
MOV R4, R2 // q
#P0 FFMA R4, R5, c[0x2][0x0], R2; // if (pred0) q = fmaf (r, (1.0f/3.0f), q)
ST.E [R6], R4; // store q
For my experiments, I wrote the tiny C test program shown below that steps through integer divisors in increasing order and for each of them exhaustively tests the above code sequence against the proper division. It prints a list of the divisors that passed this exhaustive test. Partial output looks as follows:
PASS: 1, 2, 3, 4, 5, 7, 8, 9, 11, 13, 15, 16, 17, 19, 21, 23, 25, 27, 29, 31, 32, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 64, 65, 67, 69,
To incorporate the replacement algorithm into a compiler as an optimization, a whitelist of divisors to which the above code transformation can safely be applied is impractical. The output of the program so far (at a rate of about one result per minute) suggests that the fast code works correctly across all possible encodings of x for those divisors y that are odd integers or are powers of two. Anecdotal evidence, not a proof, of course.
What set of mathematical conditions can determine a-priori whether the transformation of division into the above code sequence is safe? Answers can assume that all the floating-point operations are performed in the default rounding mode of "round to nearest or even".
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
int main (void)
{
float r, q, x, y, rc;
volatile union {
float f;
unsigned int i;
} arg, res, ref;
int err;
y = 1.0f;
printf ("PASS: ");
while (1) {
/* precompute reciprocal */
rc = 1.0f / y;
arg.i = 0x80000000;
err = 0;
do {
/* do the division, fast */
x = arg.f;
q = x * rc;
if ((x != 0) && (!isinf(x))) {
r = fmaf (-y, q, x);
q = fmaf (r, rc, q);
}
res.f = q;
/* compute the reference, slowly */
ref.f = x / y;
if (res.i != ref.i) {
err = 1;
break;
}
arg.i--;
} while (arg.i != 0x80000000);
if (!err) printf ("%g, ", y);
y += 1.0f;
}
return EXIT_SUCCESS;
}

Let me restart for the third time. We are trying to accelerate
q = x / y
where y is an integer constant, and q, x, and y are all IEEE 754-2008 binary32 floating-point values. Below, fmaf(a,b,c) indicates a fused multiply-add a * b + c using binary32 values.
The naive algorithm is via a precalculated reciprocal,
C = 1.0f / y
so that at runtime a (much faster) multiplication suffices:
q = x * C
The Brisebarre-Muller-Raina acceleration uses two precalculated constants,
zh = 1.0f / y
zl = -fmaf(zh, y, -1.0f) / y
so that at runtime, one multiplication and one fused multiply-add suffices:
q = fmaf(x, zh, x * zl)
The Markstein algorithm combines the naive approach with two fused multiply-adds that yields the correct result if the naive approach yields a result within 1 unit in the least significant place, by precalculating
C1 = 1.0f / y
C2 = -y
so that the divison can be approximated using
t1 = x * C1
t2 = fmaf(C1, t1, x)
q = fmaf(C2, t2, t1)
The naive approach works for all powers of two y, but otherwise it is pretty bad. For example, for divisors 7, 14, 15, 28, and 30, it yields an incorrect result for more than half of all possible x.
The Brisebarre-Muller-Raina approach similarly fails for almost all non-power of two y, but much fewer x yield the incorrect result (less than half a percent of all possible x, varies depending on y).
The Brisebarre-Muller-Raina article shows that the maximum error in the naive approach is ±1.5 ULPs.
The Markstein approach yields correct results for powers of two y, and also for odd integer y. (I have not found a failing odd integer divisor for the Markstein approach.)
For the Markstein approach, I have analysed divisors 1 - 19700 (raw data here).
Plotting the number of failure cases (divisor in the horizontal axis, the number of values of x where Markstein approach fails for said divisor), we can see a simple pattern occur:
(source: nominal-animal.net)
Note that these plots have both horizontal and vertical axes logarithmic. There are no dots for odd divisors, as the approach yields correct results for all odd divisors I've tested.
If we change the x axis to the bit reverse (binary digits in reverse order, i.e. 0b11101101 → 0b10110111, data) of the divisors, we have a very clear pattern:
(source: nominal-animal.net)
If we draw a straight line through the center of the point sets, we get curve 4194304/x. (Remember, the plot considers only half the possible floats, so when considering all possible floats, double it.)
8388608/x and 2097152/x bracket the entire error pattern completely.
Thus, if we use rev(y) to compute the bit reverse of divisor y, then 8388608/rev(y) is a good first order approximation of the number of cases (out of all possible float) where the Markstein approach yields an incorrect result for an even, non-power-of-two divisor y. (Or, 16777216/rev(x) for the upper limit.)
Added 2016-02-28: I found an approximation for the number of error cases using the Markstein approach, given any integer (binary32) divisor. Here it is as pseudocode:
function markstein_failure_estimate(divisor):
if (divisor is zero)
return no estimate
if (divisor is not an integer)
return no estimate
if (divisor is negative)
negate divisor
# Consider, for avoiding underflow cases,
if (divisor is very large, say 1e+30 or larger)
return no estimate - do as division
while (divisor > 16777216)
divisor = divisor / 2
if (divisor is a power of two)
return 0
if (divisor is odd)
return 0
while (divisor is not odd)
divisor = divisor / 2
# Use return (1 + 83833608 / divisor) / 2
# if only nonnegative finite float divisors are counted!
return 1 + 8388608 / divisor
This yields a correct error estimate to within ±1 on the Markstein failure cases I have tested (but I have not yet adequately tested divisors larger than 8388608). The final division should be such that it reports no false zeroes, but I cannot guarantee it (yet). It does not take into account very large divisors (say 0x1p100, or 1e+30, and larger in magnitude) which have underflow issues -- I would definitely exclude such divisors from acceleration anyway.
In preliminary testing, the estimate seems uncannily accurate. I did not draw a plot comparing the estimates and the actual errors for divisors 1 to 20000, because the points all coincide exactly in the plots. (Within this range, the estimate is exact, or one too large.) Essentially, the estimates reproduce the first plot in this answer exactly.
The pattern of failures for the Markstein approach is regular, and very interesting. The approach works for all power of two divisors, and all odd integer divisors.
For divisors greater than 16777216, I consistently see the same errors as for a divisor that is divided by the smallest power of two to yield a value less than 16777216. For example, 0x1.3cdfa4p+23 and 0x1.3cdfa4p+41, 0x1.d8874p+23 and 0x1.d8874p+32, 0x1.cf84f8p+23 and 0x1.cf84f8p+34, 0x1.e4a7fp+23 and 0x1.e4a7fp+37. (Within each pair, the mantissa is the same, and only the power of two varies.)
Assuming my test bench is not in error, this means that the Markstein approach also works divisors larger than 16777216 in magnitude (but smaller than, say, 1e+30), if the divisor is such that when divided by the smallest power of two that yields a quotient of less than 16777216 in magnitude, and the quotient is odd.

This question asks for a way to identify the values of the constant Y that make it safe to transform x / Y into a cheaper computation using FMA for all possible values of x. Another approach is to use static analysis to determine an over-approximation of the values x can take, so that the generally unsound transformation can be applied in the knowledge that the values for which the transformed code differs from the original division do not happen.
Using representations of sets of floating-point values that are well adapted to the problems of floating-point computations, even a forwards analysis starting from the beginning of the function can produce useful information. For instance:
float f(float z) {
float x = 1.0f + z;
float r = x / Y;
return r;
}
Assuming the default round-to-nearest mode(*), in the above function x can only be NaN (if the input is NaN), +0.0f, or a number larger than 2-24 in magnitude, but not -0.0f or anything closer to zero than 2-24. This justifies the transformation into one of the two forms shown in the question for many values of the constant Y.
(*) assumption without which many optimizations are impossible and that C compilers already make unless the program explicitly uses #pragma STDC FENV_ACCESS ON
A forwards static analysis that predicts the information for x above can be based on a representation of sets of floating-point values an expression can take as a tuple of:
a representation for the sets of possible NaN values (Since behaviors of NaN are underspecified, a choice is to use only a boolean, with true meaning some NaNs can be present, and false indicating no NaN is present.),
four boolean flags indicating respectively the presence of +inf, -inf, +0.0, -0.0,
an inclusive interval of negative finite floating-point values, and
an inclusive interval of positive finite floating-point values.
In order to follow this approach, all the floating-point operations that can occur in a C program must be understood by the static analyzer. To illustrate, the addition betweens sets of values U and V, to be used to handle + in the analyzed code, can be implemented as:
If NaN is present in one of the operands, or if the operands can be infinities of opposite signs, NaN is present in the result.
If 0 cannot be a result of the addition of a value of U and a value of V, use standard interval arithmetic. The upper bound of the result is obtained for the round-to-nearest addition of the largest value in U and the largest value in V, so these bounds should be computed with round-to-nearest.
If 0 can be a result of the addition of a positive value of U and a negative value of V, then let M be the smallest positive value in U such that -M is present in V.
if succ(M) is present in U, then this pair of values contributes succ(M) - M to the positive values of the result.
if -succ(M) is present in V, then this pair of values contributes the negative value M - succ(M) to the negative values of the result.
if pred(M) is present in U, then this pair of values contributes the negative value pred(M) - M to the negative values of the result.
if -pred(M) is present in V, then this pair of values contributes the value M - pred(M) to the positive values of the result.
Do the same work if 0 can be the result of the addition of a negative value of U and a positive value of V.
Acknowledgement: the above borrows ideas from “Improving the Floating Point Addition and Subtraction Constraints”, Bruno Marre & Claude Michel
Example: compilation of the function f below:
float f(float z, float t) {
float x = 1.0f + z;
if (x + t == 0.0f) {
float r = x / 6.0f;
return r;
}
return 0.0f;
}
The approach in the question refuses to transform the division in function f into an alternate form, because 6 is not one of the value for which the division can be unconditionally transformed. Instead, what I am suggesting is to apply a simple value analysis starting from the beginning of the function which, in this case, determines that x is a finite float either +0.0f or at least 2-24 in magnitude, and to use this information to apply Brisebarre et al's transformation, confident in the knowledge that x * C2 does not underflow.
To be explicit, I am suggesting to use an algorithm such as the one below to decide whether or not to transform the division into something simpler:
Is Y one of the values that can be transformed using Brisebarre et al's method according to their algorithm?
Do C1 and C2 from their method have the same sign, or is it possible to exclude the possibility that the dividend is infinite?
Do C1 and C2 from their method have the same sign, or can x take only one of the two representations of 0? If in the case where C1 and C2 have different signs and x can only be one representation of zero, remember to fiddle(**) with the signs of the FMA-based computation to make it produce the correct zero when x is zero.
Can the magnitude of the dividend be guaranteed to be large enough to exclude the possibility that x * C2 underflows?
If the answer to the four questions is “yes”, then the division can be transformed into a multiplication and an FMA in the context of the function being compiled. The static analysis described above serves to answer questions 2., 3. and 4.
(**) “fiddling with the signs” means using -FMA(-C1, x, (-C2)*x) in place of FMA(C1, x, C2*x) when this is necessary to make the result come out correctly when x can only be one of the two signed zeroes

I love #Pascal's answer but in optimization it's often better to have a simple and well-understood subset of transformations rather than a perfect solution.
All current and common historical floating point formats had one thing in common: a binary mantissa.
Therefore, all fractions were rational numbers of the form:
x / 2n
This is in contrast to the constants in the program (and all possible base-10 fractions) which are rational numbers of the form:
x / (2n * 5m)
So, one optimization would simply test the input and reciprocal for m == 0, since those numbers are represented exactly in the FP format and operations with them should produce numbers that are accurate within the format.
So, for example, within the (decimal 2-digit) range of .01 to 0.99 dividing or multiplying by the following numbers would be optimized:
.25 .50 .75
And everything else would not. (I think, do test it first, lol.)

The result of a floating point division is:
a sign flag
a significand
an exponent
a set of flags (overflow, underflow, inexact, etc - see fenv())
Getting the first 3 pieces correct (but the set of flags incorrect) is not enough. Without further knowledge (e.g. which parts of which pieces of the result actually matter, the possible values of the dividend, etc) I would assume that replacing division by a constant with multiplication by a constant (and/or a convoluted FMA mess) is almost never safe.
In addition; for modern CPUs I also wouldn't assume that replacing a division with 2 FMAs is always an improvement. For example, if the bottleneck is instruction fetch/decode, then this "optimisation" would make performance worse. For another example, if subsequent instructions don't depend on the result (the CPU can do many other instructions in parallel while waiting for the result) the FMA version may introduce multiple dependency stalls and make performance worse. For a third example, if all registers are being used then the FMA version (which requires additional "live" variables) may increase "spilling" and make performance worse.
Note that (in many but not all cases) division or multiplication by a constant multiple of 2 can be done with addition alone (specifically, adding a shift count to the exponent).

Related

Underflow error in floating point arithmetic in C

I am new to C, and my task is to create a function
f(x) = sqrt[(x^2)+1]-1
that can handle very large numbers and very small numbers. I am submitting my script on an online interface that checks my answers.
For very large numbers I simplify the expression to:
f(x) = x-1
By just using the highest power. This was the correct answer.
The same logic does not work for smaller numbers. For small numbers (on the order of 1e-7), they are very quickly truncated to zero, even before they are squared. I suspect that this has to do with floating point precision in C. In my textbook, it says that the float type has smallest possible value of 1.17549e-38, with 6 digit precision. So although 1e-7 is much larger than 1.17e-38, it has a higher precision, and is therefore rounded to zero. This is my guess, correct me if I'm wrong.
As a solution, I am thinking that I should convert x to a long double when x < 1e-6. However when I do this, I still get the same error. Any ideas? Let me know if I can clarify. Code below:
#include <math.h>
#include <stdio.h>
double feval(double x) {
/* Insert your code here */
if (x > 1e299)
{;
return x-1;
}
if (x < 1e-6)
{
long double g;
g = x;
printf("x = %Lf\n", g);
long double a;
a = pow(x,2);
printf("x squared = %Lf\n", a);
return sqrt(g*g+1.)- 1.;
}
else
{
printf("x = %f\n", x);
printf("Used third \n");
return sqrt(pow(x,2)+1.)-1;
}
}
int main(void)
{
double x;
printf("Input: ");
scanf("%lf", &x);
double b;
b = feval(x);
printf("%f\n", b);
return 0;
}

For small inputs, you're getting truncation error when you do 1+x^2. If x=1e-7f, x*x will happily fit into a 32 bit floating point number (with a little bit of error due to the fact that 1e-7 does not have an exact floating point representation, but x*x will be so much smaller than 1 that floating point precision will not be sufficient to represent 1+x*x.
It would be more appropriate to do a Taylor expansion of sqrt(1+x^2), which to lowest order would be
sqrt(1+x^2) = 1 + 0.5*x^2 + O(x^4)
Then, you could write your result as
sqrt(1+x^2)-1 = 0.5*x^2 + O(x^4),
avoiding the scenario where you add a very small number to 1.
As a side note, you should not use pow for integer powers. For x^2, you should just do x*x. Arbitrary integer powers are a little trickier to do efficiently; the GNU scientific library for example has a function for efficiently computing arbitrary integer powers.

There are two issues here when implementing this in the naive way: Overflow or underflow in intermediate computation when computing x * x, and substractive cancellation during final subtraction of 1. The second issue is an accuracy issue.
ISO C has a standard math function hypot (x, y) that performs the computation sqrt (x * x + y * y) accurately while avoiding underflow and overflow in intermediate computation. A common approach to fix issues with subtractive cancellation is to transform the computation algebraically such that it is transformed into multiplications and / or divisions.
Combining these two fixes leads to the following implementation for float argument. It has an error of less than 3 ulps across all possible inputs according to my testing.
/* Compute sqrt(x*x+1)-1 accurately and without spurious overflow or underflow */
float func (float x)
{
return (x / (1.0f + hypotf (x, 1.0f))) * x;
}

A trick that is often useful in these cases is based on the identity
(a+1)*(a-1) = a*a-1
In this case
sqrt(x*x+1)-1 = (sqrt(x*x+1)-1)*(sqrt(x*x+1)+1)
/(sqrt(x*x+1)+1)
= (x*x+1-1) / (sqrt(x*x+1)+1)
= x*x/(sqrt(x*x+1)+1)
The last formula can be used as an implementation. For vwry small x sqrt(x*x+1)+1 will be close to 2 (for small enough x it will be 2) but we don;t loose precision in evaluating it.

The problem isn't with running into the minimum value, but with the precision.
As you said yourself, float on your machine has about 7 digits of precision. So let's take x = 1e-7, so that x^2 = 1e-14. That's still well within the range of float, no problems there. But now add 1. The exact answer would be 1.00000000000001. But if we only have 7 digits of precision, this gets rounded to 1.0000000, i.e. exactly 1. So you end up computing sqrt(1.0)-1 which is exactly 0.
One approach would be to use the linear approximation of sqrt around x=1 that sqrt(x) ~ 1+0.5*(x-1). That would lead to the approximation f(x) ~ 0.5*x^2.

Float data type uncertainty

I am doing a numerical analysis of a math software I developed. I want to identify what is the uncertainty of my result. Being f() my method and x an input value, I want to identify y of my result as f(x) +/- y. My f() method has multiple operations between float variables. To study the error propagation occurred in f(), I have to apply the Statistical Propagation of Uncertainty formulas and in order to do so I have to know the uncertainty of a float variable.
I do understand the architecture of a float variable as specified in the IEEE 754 standard and the rounding error converting a decimal value to float inherent to the latter.
From what I understood of the literature, the FLT_EPSILON macro in http://www.cplusplus.com/reference/cfloat/
defines my y value but this quick test proves it wrong:
float f1 = 1.234567f;
float f2 = 1.234567f + 1.192092896e-7f;
float f3 = 1.234567f + 1.192092895e-7f;
printf("Inicial:\t%f\n", f1);
printf("Inicial:\t%f\n", f2);
printf("Inicial:\t%f\n\n", f3);
Output:
Inicial: 1.234567
Inicial: 1.234567
Inicial: 1.234567
When the expected output should be:
Inicial: 1.234567
Inicial: 1.234568 <---
Inicial: 1.234567
What is that I am wrong about?
Should not the float value of x + FLT_EPSILON and x - FLT_EPSILON be the same?
EDIT: My question is being R the float value of x, what is the y value that x + y || x - y equals the same R float value?

Propagation of uncertainty is from the field of statistics and refers to how uncertainties in inputs affect mathematical functions of them. The analysis of errors that occur in computational arithmetic is numerical analysis.
FLT_EPSILON is not a measure of uncertainty or error in floating-point results. It is the distance between 1 and the next value representable in the float type. Hence, it is the size of steps between representable numbers at the magnitude of 1.
When you convert a decimal numeral to floating-point, the rounding error that results may have a magnitude of up to ½ the step size when the common round-to-nearest mode is used. The reason the bound is ½ the step size is that for any number x (within the finite domain of the floating-point format), there is a representable value within ½ the step size (inclusive). This is because, if there is a representable number more than ½ the step size in one direction, there is a representable number less than ½ the step size in the other direction.
The step size varies with the magnitudes of the numbers. With binary floating-point, it doubles at 2, and again at 4, then 8, and so on. Below 1, it halves, and again at ½, ¼, and so on.
When you perform floating-point arithmetic operations, the rounding that occurs in the computation may compound or cancel previous errors. There is no general formula for the final error.
The two numerals use used in your sample code, 1.192092897e-7f and 1.192092896e-7f, are so close together that they convert to the same float value, 2−23. That is why there is no difference in your f2 and f3.
There is a difference between f1 and f2, but you did not print enough digits to display it.
You ask “Should not the float value of x + FLT_EPSILON and x - FLT_EPSILON be the same?”, but your code does not contain x - FLT_EPSILON.
Re: “My question is being R the float value of x, what is the y value that x + y || x - y equals the same R float value?” This is trivially satisfied by y = 0. Did you mean to ask what is the largest value of y that satisfies the condition? That is a bit complicated.
The step size for a number x is called the ULP of x, which we may consider as a function ULP(x). ULP stands for Unit of Least Precision. It is the place value of the least digit in the floating-point representation of x. It is not a constant; it is a function of x.
For most values representable in a floating-point format, the largest y that satisfies your condition is ½ ULP(x) of the least digit in the floating-point representation of x is even and, if the digit is odd, it is just under ½ ULP(x). This complication arises from the rule that the results of arithmetic are rounded to the nearest representable value and, in case of a tie, the value with the even low digit is chosen. Thus, adding ½ ULP(x) to x will yield a tie that will round to x if the low digit is even, but will not round to x if the low digit is odd.
However, for x that are on the boundary where the ULP changes, the largest y that satisfies your condition is ¼ ULP(x). This is because, just below x (in magnitude), the step size changes, and the next number lower than x is half of x’s step size away instead of the usual full step size. So you can only go halfway toward that value before changing the result of the subtraction, so the most y can be is ¼ ULP(x).

Float is a 32 bit IEEE 754 single precision Floating Point Number: 1 bit for the sign, 8 bits for the exponent, and 23* for the value, i.e. float has 7 decimal digits of precision.
Increase the printf number of printed digits to see more but after 7 digits its just noise:
#include <stdio.h>
int main(void) {
float f1 = 1.234567f;
float f2 = 1.234567f + 1.192092897e-7f;
float f3 = 1.234567f + 1.192092896e-7f;
printf("Inicial:\t%.16f\n", f1);
printf("Inicial:\t%.16f\n", f2);
printf("Inicial:\t%.16f\n\n", f3);
return 0;
}
Output:
Inicial: 1.2345670461654663
Inicial: 1.2345671653747559
Inicial: 1.2345671653747559

float f1 = 1.234567f;
float f2 = f1 + 1.192092897e-7f;
float f3 = f1 + 1.192092896e-7f;
printf("Inicial:\t%.20f\n", f1);
printf("Inicial:\t%.20f\n", f2);
printf("Inicial:\t%.20f\n\n", f3);
Output:
Inicial: 1.23456704616546630000
Inicial: 1.23456716537475590000
Inicial: 1.23456716537475590000
No, your expectation is wrong
In the first printf call, you're printing the variable f1 with no effect which is just 1.234567f.

Computing RD(sqrt(x)) with a FPU in RU mode

Intervals of floating-point bounds can be used to over-approximate sets of reals, as long as the upper bound of any result interval is computed in round-upwards and the lower bound in round-downwards.
One recommended trick is to actually compute the negation of the lower bound. This allows to keep the FPU in round-upwards at all times (for instance, “Handbook of Floating-Point Arithmetic”, 2.9.2).
This works well for addition and for multiplication. The square root operation, on the other hand, is not symmetrical in the ways addition and multiplication are.
It occurs to me that in order to compute sqrtRD, for the lower bound, the following idiom, despite its complications, might be faster on an ordinary platform with IEEE 754 double-precision and FLT_EVAL_METHOD defined to 0 than changing the rounding mode twice:
#include <fenv.h>
#include <math.h>
#pragma STDC FENV_ACCESS ON
…
/* assumes round-upwards */
double sqrt_rd(double l) {
feclearexcept(FE_INEXACT);
double candidate = sqrt(l);
if (fetestexcept(FE_INEXACT))
return nextafter(candidate, 0);
return candidate;
}
I am wondering whether this is better, and whether if it is the fastest yet. As one possible alternative, but still not necessarily the fastest, it seems to me that FMARU(candidate, candidate, -l) might perhaps not be always exact (because of the directed rounding) but might be accurate enough around 0 for the following to work:
/* assumes round-upwards */
double sqrt_rd(double l) {
double candidate = sqrt(l);
if (fma(candidate, candidate, -l) != 0.0)
return nextafter(candidate, 0);
return candidate;
}
What other inexpensive ways are there to detect that sqrt was inexact?
What combination of floating-point operations leads to the fastest computation of sqrt_rd on a modern FPU set to round upwards?

I think you should be able to use:
/* assumes round-upwards */
double sqrt_rd(double l) {
double u = sqrt(l);
double w = u*u;
if (w != l)
return nextafter(u, 0);
return u;
}
The justification here being that if u is inexact, then it will be strictly greater than √l, which in turn implies that w >= u2 > l (since w is also calculated in RU mode). And if u is exact, then so is w (since we know it must be representable as a double).

fma calculates the result with infinite precision, then applies the rounding mode.
If your candidate is too large, then the infinitely precise result is greater than 0, and since you are rounding up, it will be rounded up. Even if it is only a tiny little bit larger than zero. To verify this, first try l = 1 + 2eps, where (1 + eps) = sqrt (1 + 2eps + eps^2) is just a tiny bit too large; then scale l down by a negative power of 4 so that the eps^2 is way beyond the resolution of denormalised numbers, and check that as well.

accuracy of sqrt of integers

I have a loop like this:
for(uint64_t i=0; i*i<n; i++) {
This requires doing a multiplication every iteration. If I could calculate the sqrt before the loop then I could avoid this.
unsigned cut = sqrt(n)
for(uint64_t i=0; i<cut; i++) {
In my case it's okay if the sqrt function rounds up to the next integer but it's not okay if it rounds down.
My question is: is the sqrt function accurate enough to do this for all cases?
Edit: Let me list some cases. If n is a perfect square so that n = y^2 my question would be - is cut=sqrt(n)>=y for all n? If cut=y-1 then there is a problem. E.g. if n = 120 and cut = 10 it's okay but if n=121 (11^2) and cut is still 10 then it won't work.
My first concern was the fractional part of float only has 23 bits and double 52 so they can't store all the digits of some 32-bit or 64-bit integers. However, I don't think this is a problem. Let's assume we want the sqrt of some number y but we can't store all the digits of y. If we let the fraction of y we can store be x we can write y = x + dx then we want to make sure that whatever dx we choose does not move us to the next integer.
sqrt(x+dx) < sqrt(x) + 1 //solve
dx < 2*sqrt(x) + 1
// e.g for x = 100 dx < 21
// sqrt(100+20) < sqrt(100) + 1
Float can store 23 bits so we let y = 2^23 + 2^9. This is more than sufficient since 2^9 < 2*sqrt(2^23) + 1. It's easy to show this for double as well with 64-bit integers. So although they can't store all the digits as long as the sqrt of what they can store is accurate then the sqrt(fraction) should be sufficient. Now let's look at what happens for integers close to INT_MAX and the sqrt:
unsigned xi = -1-1;
printf("%u %u\n", xi, (unsigned)(float)xi); //4294967294 4294967295
printf("%u %u\n", (unsigned)sqrt(xi), (unsigned)sqrtf(xi)); //65535 65536
Since float can't store all the digits of 2^31-2 and double can they get different results for the sqrt. But the float version of the sqrt is one integer larger. This is what I want. For 64-bit integers as long as the sqrt of the double always rounds up it's okay.

First, integer multiplication is really quite cheap. So long as you have more than a few cycles of work per loop iteration and one spare execute slot, it should be entirely hidden by reorder on most non-tiny processors.
If you did have a processor with dramatically slow integer multiply, a truly clever compiler might transform your loop to:
for (uint64_t i = 0, j = 0; j < cut; j += 2*i+1, i++)
replacing the multiply with an lea or a shift and two adds.
Those notes aside, let’s look at your question as stated. No, you can’t just use i < sqrt(n). Counter-example: n = 0x20000000000000. Assuming adherence to IEEE-754, you will have cut = 0x5a82799, and cut*cut is 0x1ffffff8eff971.
However, a basic floating-point error analysis shows that the error in computing sqrt(n) (before conversion to integer) is bounded by 3/4 of an ULP. So you can safely use:
uint32_t cut = sqrt(n) + 1;
and you’ll perform at most one extra loop iteration, which is probably acceptable. If you want to be totally precise, instead use:
uint32_t cut = sqrt(n);
cut += (uint64_t)cut*cut < n;
Edit: z boson clarifies that for his purposes, this only matters when n is an exact square (otherwise, getting a value of cut that is “too small by one” is acceptable). In that case, there is no need for the adjustment and on can safely just use:
uint32_t cut = sqrt(n);
Why is this true? It’s pretty simple to see, actually. Converting n to double introduces a perturbation:
double_n = n*(1 + e)
which satisfies |e| < 2^-53. The mathematical square root of this value can be expanded as follows:
square_root(double_n) = square_root(n)*square_root(1+e)
Now, since n is assumed to be a perfect square with at most 64 bits, square_root(n) is an exact integer with at most 32 bits, and is the mathematically precise value that we hope to compute. To analyze the square_root(1+e) term, use a taylor series about 1:
square_root(1+e) = 1 + e/2 + O(e^2)
= 1 + d with |d| <~ 2^-54
Thus, the mathematically exact value square_root(double_n) is less than half an ULP away from[1] the desired exact answer, and necessarily rounds to that value.
[1] I’m being fast and loose here in my abuse of relative error estimates, where the relative size of an ULP actually varies across a binade — I’m trying to give a bit of the flavor of the proof without getting too bogged down in details. This can all be made perfectly rigorous, it just gets to be a bit wordy for Stack Overflow.

All my answer is useless if you have access to IEEE 754 double precision floating point, since Stephen Canon demonstrated both
a simple way to avoid imul in loop
a simple way to compute the ceiling sqrt
Otherwise, if for some reason you have a non IEEE 754 compliant platform, or only single precision, you could get the integer part of square root with a simple Newton-Raphson loop. For example in Squeak Smalltalk we have this method in Integer:
sqrtFloor
"Return the integer part of the square root of self"
| guess delta |
guess := 1 bitShift: (self highBit + 1) // 2.
[
delta := (guess squared - self) // (guess + guess).
delta = 0 ] whileFalse: [
guess := guess - delta ].
^guess - 1
Where // is operator for quotient of integer division.
Final guard guess*guess <= self ifTrue: [^guess]. can be avoided if initial guess is fed in excess of exact solution as is the case here.
Initializing with approximate float sqrt was not an option because integers are arbitrarily large and might overflow
But here, you could seed the initial guess with floating point sqrt approximation, and my bet is that the exact solution will be found in very few loops. In C that would be:
uint32_t sqrtFloor(uint64_t n)
{
int64_t diff;
int64_t delta;
uint64_t guess=sqrt(n); /* implicit conversions here... */
while( (delta = (diff=guess*guess-n) / (guess+guess)) != 0 )
guess -= delta;
return guess-(diff>0);
}
That's a few integer multiplications and divisions, but outside the main loop.

What you are looking for is a way to calculate a rational upper bound of the square root of a natural number. Continued fraction is what you need see wikipedia.
For x>0, there is
.
To make the notation more compact, rewriting the above formula as
Truncate the continued fraction by removing the tail term (x-1)/2's at each recursion depth, one gets a sequence of approximations of sqrt(x) as below:
Upper bounds appear at lines with odd line numbers, and gets tighter. When distance between an upper bound and its neighboring lower bound is less than 1, that approximation is what you need. Using that value as the value of cut, here cut must be a float number, solves the problem.
For very large number, rational number should be used, so no precision is lost during conversion between integer and floating point number.

I need a fast 96-bit on 64-bit specific division algorithm for a fixed-point math library

I am currently writing a fast 32.32 fixed-point math library. I succeeded at making adding, subtraction and multiplication work correctly, but I am quite stuck at division.
A little reminder for those who can't remember: a 32.32 fixed-point number is a number having 32 bits of integer part and 32 bits of fractional part.
The best algorithm I came up with needs 96-bit integer division, which is something compilers usually don't have built-ins for.
Anyway, here it goes:
G = 2^32
notation: x is the 64-bit fixed-point number, x1 is its low nibble and x2 is its high
G*(a/b) = ((a1 + a2*G) / (b1 + b2*G))*G // Decompose this
G*(a/b) = (a1*G) / (b1*G + b2) + (a2*G*G) / (b1*G + b2)
As you can see, the (a2*G*G) is guaranteed to be larger than the regular 64-bit integer. If uint128_t's were actually supported by my compiler, I would simply do the following:
((uint128_t)x << 32) / y)
Well they aren't and I need a solution. Thank you for your help.

You can decompose a larger division into multiple chunks that do division with less bits. As another poster already mentioned the algorithm can be found in TAOCP from Knuth.
However, no need to buy the book!
There is a code on the hackers delight website that implements the algorithm in C. It's written to do 64-bit unsigned divisions using 32-bit arithmetic only, so you can't directly cut'n'paste the code. To get from 64 to 128-bit you have to widen all types, masks and constans by two e.g. a short becomes a int, a 0xffff becomes 0xffffffffll ect.
After this easy easy change you should be able to do 128bit divisions.
The code is mirrored on GitHub, but was originally posted on Hackersdelight.org (original link no longer accessible).
Since your largest values only need 96-bit, One of the 64-bit divisions will always return zero, so you can even simplify the code a bit.
Oh - and before I forget this: The code only works with unsigned values. To convert from signed to unsigned divide you can do something like this (pseudo-code style):
fixpoint Divide (fixpoint a, fixpoint b)
{
// check if the integers are of different sign:
fixpoint sign_difference = a ^ b;
// do unsigned division:
fixpoint x = unsigned_divide (abs(a), abs(b));
// if the signs have been different: negate the result.
if (sign_difference < 0)
{
x = -x;
}
return x;
}
The website itself is worth checking out as well: http://www.hackersdelight.org/
By the way - nice task that you're working on.. Do you mind telling us for what you need the fixed-point library?
By the way - the ordinary shift and subtract algorithm for division would work as well.
If you target x86 you can implement it using MMX or SSE intrinsics. The algorithm relies only on primitive operations, so it could perform quite fast as well.

Better self-adjusting answer:
Forgive the C#-ism of the answer, but the following should work in all cases. There is likely a solution possible that finds the right shifts to use quicker, but I'd have to think much deeper than I can right now. This should be reasonably efficient though:
int upshift = 32;
ulong mask = 0xFFFFFFFF00000000;
ulong mod = x % y;
while ((mod & mask) != 0)
{
// Current upshift of the remainder would overflow... so adjust
y >>= 1;
mask <<= 1;
upshift--;
mod = x % y;
}
ulong div = ((x / y) << upshift) + (mod << upshift) / y;
Simple but unsafe answer:
This calculation can cause an overflow in the upshift of the x % y remainder if this remainder has any bits set in the high 32 bits, causing an incorrect answer.
((x / y) << 32) + ((x % y) << 32) / y
The first part uses integer division and gives you the high bits of the answer (shift them back up).
The second part calculates the low bits from the remainder of the high-bit division (the bit that could not be divided any further), shifted up and then divided.

I like Nils' answer, which is probably the best. It's just long division, like we all learned in grade school, except the digits are base 2^32 instead of base 10.
However, you might also consider using Newton's approximation method for division:
x := x (N + N - N * D * x)
where N is the numerator and D is the demoninator.
This just uses multiplies and adds, which you already have, and it converges very quickly to about 1 ULP of precision. On the other hand, you won't be able to acheive the exact 0.5-ULP answer in all cases.
In any case, the tricky bit is detecting and handling the overflows.

Quick -n- dirty.
Do the A/B divide with double precision floating point.
This gives you C~=A/B. It's only approximate because of floating point precision and 53 bits of mantissa.
Round off C to a representable number in your fixed point system.
Now compute (again with your fixed point) D=A-C*B. This should have significantly lower magnitude than A.
Repeat , now computing D/B with floating point. Again, round the answer to an integer. Add each division result together as you go. You can stop when your remainder is so small that your floating point divide returns 0 after rounding.
You're still not done. Now you're very close to the answer, but the divisions weren't exact.
To finalize, you'll have to do a binary search. Using the (very good) starting estimate, see if increasing it improves the error.. you basically want to bracket the proper answer and keep dividing the range in half with new tests.
Yes, you could do Newton iteration here, but binary search will likely be easier since you need only simple multiplies and adds using your existing 32.32 precision toolkit.
This is not the most efficient method, but it's by far the easiest to code.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight