Most accurate way to compute asinhf() from log1pf()? - c

The inverse hyperbolic function asinh() is closely related to the natural logarithm. I am trying to determine the most accurate way to compute asinh() from the C99 standard math function log1p(). For ease of experimentation, I am limiting myself to IEEE-754 single-precision computation right now, that is I am looking at asinhf() and log1pf(). I intend to re-use the exact same algorithm for double precision computation, i.e. asinh() and log1p(), later.
My primary goal is to minimize ulp error, the secondary goal is to minimize the number of incorrectly rounded results, under the constraint that the improved code would at most be minimally slower than the versions posted below. Any incremental improvement to accuracy, say 0.2 ulp, would be welcome. Adding a couple of FMAs (fused multiply-adds) would be fine, on the other hand I am hoping someone could identify a solution which employs a fast rsqrtf() (reciprocal square root).
The resulting C99 code should lend itself to vectorization, possibly by some minor straightforward transformations. All intermediate computation must occur at the precision of the function argument and result, as any switch to higher precision may have a severe negative performance impact. The code must work correctly both with IEEE-754 denormal support and in FTZ (flush to zero) mode.
So far, I have identified the following two candidate implementations. Note that the code may be easily transformed into a branchless vectorizable version with a single call to log1pf(), but I have not done so at this stage to avoid unnecessary obfuscation.
/* for a >= 0, asinh(a) = log (a + sqrt (a*a+1))
= log1p (a + (sqrt (a*a+1) - 1))
= log1p (a + sqrt1pm1 (a*a))
= log1p (a + (a*a / (1 + sqrt(a*a + 1))))
= log1p (a + a * (a / (1 + sqrt(a*a + 1))))
= log1p (fma (a / (1 + sqrt(a*a + 1)), a, a)
= log1p (fma (1 / (1/a + sqrt(1/a*a + 1)), a, a)
*/
float my_asinhf (float a)
{
float fa, t;
fa = fabsf (a);
#if !USE_RECIPROCAL
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = fmaf (fa / (1.0f + sqrtf (fmaf (fa, fa, 1.0f))), fa, fa);
t = log1pf (t);
}
#else // USE_RECIPROCAL
if (fa > 0x1.0p126f) { // prevent underflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = 1.0f / fa;
t = fmaf (1.0f / (t + sqrtf (fmaf (t, t, 1.0f))), fa, fa);
t = log1pf (t);
}
#endif // USE_RECIPROCAL
return copysignf (t, a); // restore sign
}
With a particular log1pf() implementation that is accurate to < 0.6 ulps, I am observing the following error statistics when testing exhaustively across all 232 possible IEEE-754 single-precision inputs. When USE_RECIPROCAL = 0, the maximum error is 1.49486 ulp, and there are 353,587,822 incorrectly rounded results. With USE_RECIPROCAL = 1, the maximum error is 1.50805 ulp, and there are only 77,569,390 incorrectly rounded results.
In terms of performance, the variant USE_RECIPROCAL = 0 will be faster if reciprocals and full divisions take roughly the same amount of time, but the variant USE_RECIPROCAL = 1 could be faster if very fast reciprocal support is available.
Answers can assume that all basic arithmetic, including FMA (fused multiply-add) is correctly rounded according to IEEE-754 round-to-nearest-or-even mode. In addition, faster, nearly correctly rounded, versions of reciprocal and rsqrtf() may be available, where "nearly correctly rounded" means the maximum ulp error will be limited to something like 0.53 ulps and the overwhelming majority of results, say > 95%, are correctly rounded. Basic arithmetic with directed roundings may be available at no additional cost to performance.

Firstly, you may want to look into the accuracy and speed of your log1pf function: these can vary a bit between libms (I've found the OS X math functions to be fast, the glibc ones to be slower but typically correctly rounded).
Openlibm, based on the BSD libm, which in turn is based on Sun's fdlibm, use multiple approaches by range, but the main bit is the relation:
t = x*x;
w = log1pf(fabsf(x)+t/(one+sqrtf(one+t)));
You may also want to try compiling with the -fno-math-errno option, which disables the old System V error codes for sqrt (IEEE-754 exceptions will still work).

After various additional experiments, I have convinced myself that a simple argument transformation that does not use higher precision than the argument and result cannot achieve a tighter error bound than the one achieved by the first variant in the code I posted.
Since my question is about minimizing the error for the argument transformation which is incurred in addition to the error in log1pf() itself, the most straightforward approach to use for experimentation is to utilize a correctly rounded implementation of that logarithm function. Note that a correctly-rounded implementation is highly unlikely to exist in the context of a high-performance environment. According to the works of J.-M. Muller et. al., to produce accurate single-precision results, x86 extended precision computation should be sufficient, for example
float accurate_log1pf (float a)
{
float res;
__asm fldln2;
__asm fld dword ptr [a];
__asm fyl2xp1;
__asm fst dword ptr [res];
__asm fcompp;
return res;
}
An implementation of asinhf() using the first variant from my question then looks as follows:
float my_asinhf (float a)
{
float fa, s, t;
fa = fabsf (a);
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = fmaf (fa / (1.0f + sqrtf (fmaf (fa, fa, 1.0f))), fa, fa);
t = accurate_log1pf (t);
}
return copysignf (t, a); // restore sign
}
Testing with all 232 IEEE-754 single-precision operands shows that the maximum error of 1.49486070 ulp occurs at ±0x1.ff5022p-9 and there are 353,521,140 incorrectly rounded results. What happens if the entire argument transformation uses double-precision arithmetic? The code changes to
float my_asinhf (float a)
{
float fa, s, t;
fa = fabsf (a);
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
double tt = fa;
tt = fma (tt / (1.0 + sqrt (fma (tt, tt, 1.0))), tt, tt);
t = (float)tt;
t = accurate_log1pf (t);
}
return copysignf (t, a); // restore sign
}
However, the error bound does not improve with this change! The maximum error of 1.49486070 ulp still occurs at ±0x1.ff5022p-9 and there are now 350,971,046 incorrectly rounded results, slightly fewer than before. The issue seems to be that a float operand cannot convey enough information to log1pf() to produce more accurate results. A similar problem occurs when computing sinf() and cosf(). If the reduced argument, represented as a correctly rounded float operand, is passed to the core polynomials, the resulting error in sinf() and cosf() is just a tad under 1.5 ulp, just as we are observing here with my_asinhf().
One solution is to compute the transformed argument to higher than single precision, for example as a double-float operand pair (a useful brief overwiew of double-float techniques can be found in this paper by Andrew Thall). In this case, we can use the additional information to perform linear interpolation on the result, based on the knowledge that the derivative of the logarithm is the reciprocal. This gives us:
float my_asinhf (float a)
{
float fa, s, t;
fa = fabsf (a);
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
double tt = fa;
tt = fma (tt / (1.0 + sqrt (fma (tt, tt, 1.0))), tt, tt);
t = (float)tt; // "head" of double-float
s = (float)(tt - (double)t); // "tail" of double-float
t = fmaf (s, 1.0f / (1.0f + t), accurate_log1pf (t)); // interpolate
}
return copysignf (t, a); // restore sign
}
Exhaustive test of this version indicates that the maximum error has been reduced to 0.99999948 ulp, it occurs at ±0x1.deeea0p-22. There are 349,653,534 incorrectly rounded results. A faithfully-rounded implementation of asinhf() has been achieved.
Unfortunately, the practical utility of this result is limited. Depending on HW platform, the throughput of arithmetic operations on double may only be 1/2 to 1/32 of the throughput of float operations. The double-precision computation can be replaced with double-float computation, but this would incur very significant cost as well. Lastly, my approach here was to use the single-precision implementation as a proving ground for subsequent double-precision work, and many hardware platforms (certainly all the ones I am interested in) do not offer hardware support for a numeric format with higher precision than IEEE-754 binary64 (double precision). Therefore any solution should not require higher-precision arithmetic in intermediate computation.
Since all the troublesome arguments in the case of asinhf() are small in magnitude, one could [partially?] address the accuracy issue by using a polynomial minimax approximation for the region around the origin. As this would create another code branch, it would likely make vectorization more difficult.

Related

Underflow error in floating point arithmetic in C

I am new to C, and my task is to create a function
f(x) = sqrt[(x^2)+1]-1
that can handle very large numbers and very small numbers. I am submitting my script on an online interface that checks my answers.
For very large numbers I simplify the expression to:
f(x) = x-1
By just using the highest power. This was the correct answer.
The same logic does not work for smaller numbers. For small numbers (on the order of 1e-7), they are very quickly truncated to zero, even before they are squared. I suspect that this has to do with floating point precision in C. In my textbook, it says that the float type has smallest possible value of 1.17549e-38, with 6 digit precision. So although 1e-7 is much larger than 1.17e-38, it has a higher precision, and is therefore rounded to zero. This is my guess, correct me if I'm wrong.
As a solution, I am thinking that I should convert x to a long double when x < 1e-6. However when I do this, I still get the same error. Any ideas? Let me know if I can clarify. Code below:
#include <math.h>
#include <stdio.h>
double feval(double x) {
/* Insert your code here */
if (x > 1e299)
{;
return x-1;
}
if (x < 1e-6)
{
long double g;
g = x;
printf("x = %Lf\n", g);
long double a;
a = pow(x,2);
printf("x squared = %Lf\n", a);
return sqrt(g*g+1.)- 1.;
}
else
{
printf("x = %f\n", x);
printf("Used third \n");
return sqrt(pow(x,2)+1.)-1;
}
}
int main(void)
{
double x;
printf("Input: ");
scanf("%lf", &x);
double b;
b = feval(x);
printf("%f\n", b);
return 0;
}
For small inputs, you're getting truncation error when you do 1+x^2. If x=1e-7f, x*x will happily fit into a 32 bit floating point number (with a little bit of error due to the fact that 1e-7 does not have an exact floating point representation, but x*x will be so much smaller than 1 that floating point precision will not be sufficient to represent 1+x*x.
It would be more appropriate to do a Taylor expansion of sqrt(1+x^2), which to lowest order would be
sqrt(1+x^2) = 1 + 0.5*x^2 + O(x^4)
Then, you could write your result as
sqrt(1+x^2)-1 = 0.5*x^2 + O(x^4),
avoiding the scenario where you add a very small number to 1.
As a side note, you should not use pow for integer powers. For x^2, you should just do x*x. Arbitrary integer powers are a little trickier to do efficiently; the GNU scientific library for example has a function for efficiently computing arbitrary integer powers.
There are two issues here when implementing this in the naive way: Overflow or underflow in intermediate computation when computing x * x, and substractive cancellation during final subtraction of 1. The second issue is an accuracy issue.
ISO C has a standard math function hypot (x, y) that performs the computation sqrt (x * x + y * y) accurately while avoiding underflow and overflow in intermediate computation. A common approach to fix issues with subtractive cancellation is to transform the computation algebraically such that it is transformed into multiplications and / or divisions.
Combining these two fixes leads to the following implementation for float argument. It has an error of less than 3 ulps across all possible inputs according to my testing.
/* Compute sqrt(x*x+1)-1 accurately and without spurious overflow or underflow */
float func (float x)
{
return (x / (1.0f + hypotf (x, 1.0f))) * x;
}
A trick that is often useful in these cases is based on the identity
(a+1)*(a-1) = a*a-1
In this case
sqrt(x*x+1)-1 = (sqrt(x*x+1)-1)*(sqrt(x*x+1)+1)
/(sqrt(x*x+1)+1)
= (x*x+1-1) / (sqrt(x*x+1)+1)
= x*x/(sqrt(x*x+1)+1)
The last formula can be used as an implementation. For vwry small x sqrt(x*x+1)+1 will be close to 2 (for small enough x it will be 2) but we don;t loose precision in evaluating it.
The problem isn't with running into the minimum value, but with the precision.
As you said yourself, float on your machine has about 7 digits of precision. So let's take x = 1e-7, so that x^2 = 1e-14. That's still well within the range of float, no problems there. But now add 1. The exact answer would be 1.00000000000001. But if we only have 7 digits of precision, this gets rounded to 1.0000000, i.e. exactly 1. So you end up computing sqrt(1.0)-1 which is exactly 0.
One approach would be to use the linear approximation of sqrt around x=1 that sqrt(x) ~ 1+0.5*(x-1). That would lead to the approximation f(x) ~ 0.5*x^2.

Accurate computation of PDF of standard normal distribution using C standard math library

The probability density function of the standard normal distribution is defined as e-x2/2 / √(2π). This can be rendered in straightforward manner into C code. A sample single-precision implementation might be:
float my_normpdff (float a)
{
return 0x1.988454p-2f * my_expf (-0.5f * a * a); /* 1/sqrt(2*pi) */
}
While this code is free from premature underflow, there is an accuracy issue since the error incurred in the computation of a2/2 is magnified by the subsequent exponentiation. One can easily demonstrate this with tests against higher-precision references. The exact error will differ based on the accuracy of the exp() or expf() implementations used; for faithfully rounded exponentiation functions one would typically observe a maximum error of around 26 ulps for IEEE-754 binary32 single precision, around 29 ulps for IEEE-754 binary64 double precision.
How can the accuracy issue be addressed in a reasonably efficient manner? A trivial approach would be to employ higher-precision intermediate computation, for example use double computation for the float implementation. But this approach does not work for a double implementation if floating-point arithmetic of higher precision is not easily available, and may be inefficient for float implementation if double arithmetic is significantly more expensive than float computation, e.g. on many GPUs.
The accuracy issue raised in the question can effectively, and efficiently, be addressed by the use of limited amounts of double-float or double-double computation, facilitated by the use of the fused multiply-add (FMA) operation.
This operation is available since C99 via the standard math functions fmaf(a,b,c) and fma(a,b,c) which compute a*b+c, without rounding of the intermediate product. While the functions map directly to fast hardware operations on almost all modern processors, they may use emulation code on older platforms, in which case they may be be very slow.
This allows the computation of the product with twice the normal precision using just two operations, resulting in a head:tail pair of native-precision numbers:
prod_hi = a * b // head
prod_lo = FMA (a, b, -hi) // tail
The high-order bits of the result can be passed to the exponentiation, while the low-order bits are used for improving the accuracy of the result via linear interpolation, taking advantage of the fact that ex is its own derivative:
e = exp (prod_hi) + exp (prod_hi) * prod_lo // exp (a*b)
This allows us to eliminate most of the error of the naive implementation. The other, minor, source of computation error is the limited precision with which the constant 1/√(2π) is represented. This can be improved by using a head:tail representation for the constant that provides twice the native precision, and computing:
r = FMA (const_hi, x, const_lo * x) // const * x
The following paper points out that this technique can even result in correctly-rounded multiplication for some arbitrary-precision constants:
Nicolas Brisebarre and Jean-Michel Muller, "Correctly rounded multiplication by arbitrary precision constants", IEEE Transactions on Computers, Vol. 57, No. 2, February 2008, pp. 165-174
Combining the two techniques, and taking care of a few corner cases involving NaNs, we arrive at the following float implementation based on IEEE-754 binary32:
float my_normpdff (float a)
{
const float RCP_SQRT_2PI_HI = 0x1.988454p-02f; /* 1/sqrt(2*pi), msbs */
const float RCP_SQRT_2PI_LO = -0x1.857936p-27f; /* 1/sqrt(2*pi), lsbs */
float ah, sh, sl, ea;
ah = -0.5f * a;
sh = a * ah;
sl = fmaf (a, ah, 0.5f * a * a); /* don't flip "sign bit" of NaN argument */
ea = expf (sh);
if (ea != 0.0f) ea = fmaf (sl, ea, ea); /* avoid creation of NaN */
return fmaf (RCP_SQRT_2PI_HI, ea, RCP_SQRT_2PI_LO * ea);
}
The corresponding double implementation, based on IEEE-754 binary64, looks almost identical, except for the different constant values used:
double my_normpdf (double a)
{
const double RCP_SQRT_2PI_HI = 0x1.9884533d436510p-02; /* 1/sqrt(2*pi), msbs */
const double RCP_SQRT_2PI_LO = -0x1.cbc0d30ebfd150p-56; /* 1/sqrt(2*pi), lsbs */
double ah, sh, sl, ea;
ah = -0.5 * a;
sh = a * ah;
sl = fma (a, ah, 0.5 * a * a); /* don't flip "sign bit" of NaN argument */
ea = exp (sh);
if (ea != 0.0) ea = fma (sl, ea, ea); /* avoid creation of NaN */
return fma (RCP_SQRT_2PI_HI, ea, RCP_SQRT_2PI_LO * ea);
}
The accuracy of these implementations depends on the accuracy of the standard math functions expf() and exp(), respectively. Where the C math library provides faithfully-rounded versions of those, the maximum error of either of the two implementations above is typically less than 2.5 ulps.

Efficient computation of 2**64 / divisor via fast floating-point reciprocal

I am currently looking into ways of using the fast single-precision floating-point reciprocal capability of various modern processors to compute a starting approximation for a 64-bit unsigned integer division based on fixed-point Newton-Raphson iterations. It requires computation of 264 / divisor, as accurately as possible, where the initial approximation must be smaller than, or equal to, the mathematical result, based on the requirements of the following fixed-point iterations. This means this computation needs to provide an underestimate. I currently have the following code, which works well, based on extensive testing:
#include <stdint.h> // import uint64_t
#include <math.h> // import nextafterf()
uint64_t divisor, recip;
float r, s, t;
t = uint64_to_float_ru (divisor); // ensure t >= divisor
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; // underestimate of 2**64 / divisor
While this code is functional, it isn't exactly fast on most platforms. One obvious improvement, which requires a bit of machine-specific code, is to replace the division r = 1.0f / t with code that makes use of a fast floating-point reciprocal provided by the hardware. This can be augmented with iteration to produce a result that is within 1 ulp of the mathematical result, so an underestimate is produced in the context of the existing code. A sample implementation for x86_64 would be:
#include <xmmintrin.h>
/* Compute 1.0f/a almost correctly rounded. Halley iteration with cubic convergence */
inline float fast_recip_f32 (float a)
{
__m128 t;
float e, r;
t = _mm_set_ss (a);
t = _mm_rcp_ss (t);
_mm_store_ss (&r, t);
e = fmaf (r, -a, 1.0f);
e = fmaf (e, e, e);
r = fmaf (e, r, r);
return r;
}
Implementations of nextafterf() are typically not performance optimized. On platforms where there are means to quickly re-interprete an IEEE 754 binary32 into an int32 and vice versa, via intrinsics float_as_int() and int_as_float(), we can combine use of nextafterf() and scaling as follows:
s = int_as_float (float_as_int (r) + 0x1fffffff);
Assuming these approaches are possible on a given platform, this leaves us with the conversions between float and uint64_t as major obstacles. Most platforms don't provide an instruction that performs a conversion from uint64_t to float with static rounding mode (here: towards positive infinity = up), and some don't offer any instructions to convert between uint64_t and floating-point types, making this a performance bottleneck.
t = uint64_to_float_ru (divisor);
r = fast_recip_f32 (t);
s = int_as_float (float_as_int (r) + 0x1fffffff);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */
A portable, but slow, implementation of uint64_to_float_ru uses dynamic changes to FPU rounding mode:
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
float uint64_to_float_ru (uint64_t a)
{
float res;
int curr_mode = fegetround ();
fesetround (FE_UPWARD);
res = (float)a;
fesetround (curr_mode);
return res;
}
I have looked into various splitting and bit-twiddling approaches to deal with the conversions (e.g. do the rounding on the integer side, then use a normal conversion to float which uses the IEEE 754 rounding mode round-to-nearest-or-even), but the overhead this creates makes this computation via fast floating-point reciprocal unappealing from a performance perspective. As it stands, it looks like I would be better off generating a starting approximation by using a classical LUT with interpolation, or a fixed-point polynomial approximation, and follow those up with a 32-bit fixed-point Newton-Raphson step.
Are there ways to improve the efficiency of my current approach? Portable and semi-portable ways involving intrinsics for specific platforms would be of interest (in particular for x86 and ARM as the currently dominant CPU architectures). Compiling for x86_64 using the Intel compiler at very high optimization (/O3 /QxCORE-AVX2 /Qprec-div-) the computation of the initial approximation takes more instructions than the iteration, which takes about 20 instructions. Below is the complete division code for reference, showing the approximation in context.
uint64_t udiv64 (uint64_t dividend, uint64_t divisor)
{
uint64_t temp, quot, rem, recip, neg_divisor = 0ULL - divisor;
float r, s, t;
/* compute initial approximation for reciprocal; must be underestimate! */
t = uint64_to_float_ru (divisor);
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */
/* perform Halley iteration with cubic convergence to refine reciprocal */
temp = neg_divisor * recip;
temp = umul64hi (temp, temp) + temp;
recip = umul64hi (recip, temp) + recip;
/* compute preliminary quotient and remainder */
quot = umul64hi (dividend, recip);
rem = dividend - divisor * quot;
/* adjust quotient if too small; quotient off by 2 at most */
if (rem >= divisor) quot += ((rem - divisor) >= divisor) ? 2 : 1;
/* handle division by zero */
if (divisor == 0ULL) quot = ~0ULL;
return quot;
}
umul64hi() would generally map to a platform-specific intrinsic, or a bit of inline assembly code. On x86_64 I currently use this implementation:
inline uint64_t umul64hi (uint64_t a, uint64_t b)
{
uint64_t res;
__asm__ (
"movq %1, %%rax;\n\t" // rax = a
"mulq %2;\n\t" // rdx:rax = a * b
"movq %%rdx, %0;\n\t" // res = (a * b)<63:32>
: "=rm" (res)
: "rm"(a), "rm"(b)
: "%rax", "%rdx");
return res;
}
This solution combines two ideas:
You can convert to floating point by simply reinterpreting the bits as floating point and subtracting a constant, so long as the number is within a particular range. So add a constant, reinterpret, and then subtract that constant. This will give a truncated result (which is therefore always less than or equal the desired value).
You can approximate reciprocal by negating both the exponent and the mantissa. This may be achieved by interpreting the bits as int.
Option 1 here only works in a certain range, so we check the range and adjust the constants used. This works in 64 bits because the desired float only has 23 bits of precision.
The result in this code will be double, but converting to float is trivial, and can be done on the bits or directly, depending on hardware.
After this you'd want to do the Newton-Raphson iteration(s).
Much of this code simply converts to magic numbers.
double
u64tod_inv( uint64_t u64 ) {
__asm__( "#annot0" );
union {
double f;
struct {
unsigned long m:52; // careful here with endianess
unsigned long x:11;
unsigned long s:1;
} u64;
uint64_t u64i;
} z,
magic0 = { .u64 = { 0, (1<<10)-1 + 52, 0 } },
magic1 = { .u64 = { 0, (1<<10)-1 + (52+12), 0 } },
magic2 = { .u64 = { 0, 2046, 0 } };
__asm__( "#annot1" );
if( u64 < (1UL << 52UL ) ) {
z.u64i = u64 + magic0.u64i;
z.f -= magic0.f;
} else {
z.u64i = ( u64 >> 12 ) + magic1.u64i;
z.f -= magic1.f;
}
__asm__( "#annot2" );
z.u64i = magic2.u64i - z.u64i;
return z.f;
}
Compiling this on an Intel core 7 gives a number of instructions (and a branch), but, of course, no multiplies or divides at all. If the casts between int and double are fast this should run pretty quickly.
I suspect float (with only 23 bits of precision) will require more than 1 or 2 Newton-Raphson iterations to get the accuracy you want, but I haven't done the math...

Efficiently computing (a - K) / (a + K) with improved accuracy

In various contexts, for example for the argument reduction for mathematical functions, one needs to compute (a - K) / (a + K), where a is a positive variable argument and K is a constant. In many cases, K is a power of two, which is the use case relevant to my work. I am looking for efficient ways to compute this quotient more accurately than can be accomplished with the straightforward division. Hardware support for fused multiply-add (FMA) can be assumed, as this operation is provided by all major CPU and GPU architectures at this time, and is available in C/C++ via the functionsfma() and fmaf().
For ease of exploration, I am experimenting with float arithmetic. Since I plan to port the approach to double arithmetic as well, no operations using higher than the native precision of both argument and result may be used. My best solution so far is:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 1 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
t = fmaf (q, -2.0f*K, m);
e = fmaf (q, -m, t);
q = fmaf (r, e, q);
For arguments a in the interval [K/2, 4.23*K], code above computes the quotient almost correctly rounded for all inputs (maximum error is exceedingly close to 0.5 ulps), provided that K is a power of 2, and there is no overflow or underflow in intermediate results. For K not a power of two, this code is still more accurate than the naive algorithm based on division. In terms of performance, this code can be faster than the naive approach on platforms where the floating-point reciprocal can be computed faster than the floating-point division.
I make the following observation when K = 2n: When the upper bound of the work interval increases to 8*K, 16*K, ... maximum error increases gradually and starts to slowly approximate the maximum error of the naive computation from below. Unfortunately, the same does not appear to be true for the lower bound of the interval. If the lower bound drops to 0.25*K, the maximum error of the improved method above equals the maximum error of the naive method.
Is there a method to compute q = (a - K) / (a + K) that can achieve smaller maximum error (measured in ulp vs the mathematical result) compared to both the naive method and the above code sequence, over a wider interval, in particular for intervals whose lower bound is less than 0.5*K? Efficiency is important, but a few more operations than are used in the above code can likely be tolerated.
In one answer below, it was pointed out that I could enhance accuracy by returning the quotient as an unevaluated sum of two operands, that is, as a head-tail pair q:qlo, i.e. similar to the well-known double-float and double-double formats. In my code above, this would mean changing the last line to qlo = r * e.
This approach is certainly useful, and I had already contemplated its use for an extended-precision logarithm for use in pow(). But it doesn't fundamentally help with the desired widening of the interval on which the enhanced computation provides more accurate quotients. In a particular case I am looking at, I would like to use K=2 (for single precision) or K=4 (for double precision) to keep the primary approximation interval narrow, and the interval for a is roughly [0,28]. The practical problem I am facing is that for arguments < 0.25*K the accuracy of the improved division is not substantially better than with the naive method.
If a is large compared to K, then (a-K)/(a+K) = 1 - 2K / (a + K) will give a good approximation. If a is small compared to K, then 2a / (a + K) - 1 will give a good approximation. If K/2 ≤ a ≤ 2K, then a-K is an exact operation, so doing the division will give a decent result.
One possibility is to track error of m and p into m1 and p1 with classical Dekker/Schewchuk:
m=a-k;
k0=a-m;
a0=k0+m;
k1=k0-k;
a1=a-a0;
m1=a1+k1;
p=a+k;
k0=p-a;
a0=p-k0;
k1=k-k0;
a1=a-a0;
p1=a1+k1;
Then, correct the naive division:
q=m/p;
r0=fmaf(p,-q,m);
r1=fmaf(p1,-q,m1);
r=r0+r1;
q1=r/p;
q=q+q1;
That'll cost you 2 divisions, but should be near half ulp if I didn't screw up.
But these divisions can be replaced by multiplications with inverse of p without any problem, since the first incorrectly rounded division will be compensated by remainder r, and second incorrectly rounded division does not really matter (the last bits of correction q1 won't change anything).
I don't really have an answer (proper floating point error analyses are very tedious) but a few observations:
Fast reciprocal instructions (such as RCPSS) are not as accurate as division, so you may see a reduction in accuracy if using these.
m is computed exactly if a &in; [0.5×Kb, 21+n×Kb), where Kb is the power of 2 below K (or K itself if K is a power of 2), and n is the number of trailing zeros in the significand of K (i.e. if K is a power of 2, then n=23).
This is similar to a simplified form of the div2 algorithm from Dekker (1971): to expand the range (particularly the lower bound), you'll probably have to incorporate more correction terms from this (i.e. store m as the sum of 2 floats, or use a double).
Since my goal is to merely widen the interval on which accurate results are achieved, rather than to find a solution that works for all possible values of a, making use of double-float arithmetic for all intermediate computation seems too costly.
Thinking some more about the problem, it is clear that the computation of the remainder of the division, e in the code from my question, is the crucial part of achieving more accurate result. Mathematically, the remainder is (a-K) - q * (a+K). In my code, I simply used m to represent (a-K) and represented (a+k) as m + 2*K, as this delivers numerically superior results to the straightforward representation.
With relatively small additional computational cost, (a+K) can be represented as a double-float, that is, a head-tail pair p:plo, which leads to the following modified version of my original code:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 2 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mx = fmaxf (a, K);
mn = fminf (a, K);
plo = (mx - p) + mn;
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
q = fmaf (r, e, q);
Testing shows that this delivers nearly correctly rounded results for a in [K/2, 224*K), allowing for a substantial increase to the upper bound of the interval on which accurate results are achieved.
Widening the interval at the lower end requires the more accurate representation of (a-K). We can compute this as a double-float head-tail pair m:mlo, which leads to the following code variant:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 3 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
plo = (a < K) ? ((K - p) + a) : ((a - p) + K);
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
e = e + mlo;
q = fmaf (r, e, q);
Exhaustive testing hows that this delivers nearly correctly rounded results for a in the interval [K/224, K*224). Unfortunately, this comes at a cost of ten additional operations compared to the code in my question, which is a steep price to pay to get the maximum error from around 1.625 ulps with the naive computation down to near 0.5 ulp.
As in my original code from the question, one can express (a+K) in terms of (a-K), thus eliminating the computation of the tail of p, plo. This approach results in the following code:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 4 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -2.0f*K, m);
t = fmaf (q, -m, t);
e = fmaf (q - 1.0f, -mlo, t);
q = fmaf (r, e, q);
This turns out to be advantageous if the main focus is decreasing the lower limit of the interval, which is my particular focus as explained in the question. Exhaustive testing of the single-precision case shows that when K=2n nearly correctly rounded results are produced for values of a in the interval [K/224, 4.23*K]. With a total of 14 or 15 operations (depending on whether an architecture supports full predication or just conditional moves), this requires seven to eight more operations than my original code.
Lastly, one might base the residual computation directly on the original variable a to avoid the error inherent in the computation of m and p. This leads to the following code that, for K = 2n, computes nearly correctly rounded results for a in the interval [K/224, K/3):
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 5 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
t = fmaf (q + 1.0f, -K, a);
e = fmaf (q, -a, t);
q = fmaf (r, e, q);
If you can relax the API to return another variable that models the error, then the solution becomes much simpler:
float foo(float a, float k, float *res)
{
float ret=(a-k)/(a+k);
*res = fmaf(-ret,a+k,a-k)/(a+k);
return ret;
}
This solution only handles truncation error of division, but does not handle the loss of precision of a+k and a-k.
To handle those errors, I think I need to use double precision, or bithack to use fixed point.
Test code is updated to artificially generate non zero least significant bits
in the input
test code
https://ideone.com/bHxAg8
The problem is the addition in (a + K). Any loss of precision in (a + K) is magnified by the division. The problem isn't the division itself.
If the exponents of a and K are the same (almost) no precision is lost, and if the absolute difference between the exponents is greater than the significand size then either (a + K) == a (if a has larger magnitude) or (a + K) == K (if K has larger magnitude).
There is no way to prevent this. Increasing the significand size (e.g. using 80-bit "extended double" on 80x86) only helps widen the "accurate result range" slightly. To understand why, consider smallest + largest (where smallest is the smallest positive denormal a 32-bit floating point number can be). In this case (for 32-bit floats) you'd need a significand size of about 260 bits for the result to avoid precision loss completely. Doing (e.g.) temp = 1/(a + K); result = a * temp - K / temp; won't help much either because you've still got exactly the same (a + K) problem (but it would avoid a similar problem in (a - K)). Also you can't do result = anything / p + anything_error/p_error because division doesn't work like that.
There are only 3 alternatives I can think of to get close to 0.5 ulps for all possible positive values of a that can fit in 32-bit floating point. None are likely to be acceptable.
The first alternative involves pre-computing a lookup table (using "big real number" maths) for every value of a, which (with some tricks) ends up being about 2 GiB for 32-bit floating point (and completely insane for 64-bit floating point). Of course if the range of possible values of a is smaller than "any positive value that can fit in a 32-bit float" the size of the lookup table would be reduced.
The second alternative is to use something else ("big real number") for the calculation at run-time (and convert to/from 32-bit floating point).
The third alternative involves, "something" (I don't know what it's called, but it's expensive). Set the rounding mode to "round to positive infinity" and calculate temp1 = (a + K); if(a < K) temp2 = (a - K); then switch to "round to negative infinity" and calculate if(a >= K) temp2 = (a - K); lower_bound = temp2 / temp1;. Next do a_lower = a and decrease a_lower by the smallest amount possible and repeat the "lower_bound" calculation, and keep doing that until you get a different value for lower_bound, then revert back to the previous value of a_lower. After that you do essentially the same (but opposite rounding modes, and incrementing not decrementing) to determine upper_bound and a_upper (starting with the original value of a). Finally, interpolate, like a_range = a_upper - a_lower; result = upper_bound * (a_upper - a) / a_range + lower_bound * (a - a_lower) / a_range;. Note that you will want to calculate an initial upper and lower bound and skip all of this if they're equal. Also be warned that this is all "in theory, completely untested" and I probably borked it somewhere.
Mainly what I'm saying is that (in my opinion) you should give up and accept that there's nothing that you can do to get close to 0.5 ulp. Sorry.. :)

Floating-point division - bias to avoid a result less than an 'exact' value

I am currently tightening floating-point numerics for an estimate of a value. (It's: p(k,t) for those who are interested.) Essentially, the utility can never yield an under-estimate of this value: the security of probable prime generation depends on a numerically robust implementation. While output results agree with the published values, I have used the DBL_EPSILON value to ensure that division, in particular, yields a result that is never less than the true value:
Consider: double x, y; /* assigned some values... */
The evaluation: r = x / y; occurs frequently, but these (finite precision) results may truncate significant digits from the true result - a possibly infinite precision rational expansion. I currently try to mitigate this by applying a bias to the numerator, i.e.,
r = ((1.0 + DBL_EPSILON) * x) / y;
If you know anything about this subject, p(k,t) is typically much smaller than most estimates - but it's simply not good enough to dismiss the issue with this "observation". I can of course state:
(((1.0 + DBL_EPSILON) * x) / y) >= (x / y)
Of course, I need to ensure that the 'biased' result is greater than, or equal to, the 'exact' value. While I am certain it has to do with manipulating or scaling DBL_EPSILON, I obviously want the 'biased' result to exceed the 'exact' result by a minimum - demonstrable under IEEE-754 arithmetic assumptions.
Yes, I've looked though Goldberg's paper, and I've searched for a robust solution. Please don't suggest manipulation of rounding modes. Ideally, I'm after an answer by someone with a very good grasp on floating-point theorems, or knows of a very well illustrated example.
EDIT: To clarify, (((1.0 + DBL_EPSILON) * x) / y) or a form (((1.0 + c) * x) / y), is not a prerequisite. This was simply an approach I was using as 'probably good enough', without having provided a solid basis for it. I can state that the numerator and denominator will not be special values: NaNs, Infs, etc., nor will the denominator be zero.
First: I know that you don't want to set the rounding mode, but it really should be said that
in terms of precision, as others have noted, setting the rounding mode will produce as good of an answer as possible. Specifically, assuming that x and y are both positive (which seems to be the case, but hasn't been explicitly stated in your question), the following is a standard C snippet with the desired effect[1]:
#include <math.h>
#pragma STDC FENV_ACCESS on
int OldRoundingMode = fegetround();
fesetround(FE_UPWARD);
r = x/y;
fesetround(OldRoundingMode);
Now, that aside, there are legitimate reasons not to want to change the rounding mode (some platforms don't support round-to-plus-infinity, on some platforms changing the rounding mode introduces a large serializing stall, etc etc), and your desire not to do so shouldn't be brushed aside so casually. So, respecting your question, what else can we do?
If your platform supports fused multiply-add, there's a very elegant solution available to you:
#include <math.h>
r = x/y;
if (fma(r,y,-x) < 0) r = nextafter(r, INFINITY);
On platforms with hardware fma support, this is very efficient. Even if fma( ) is implemented in software, it may be acceptable. This approach has the virtue that it will deliver the same result as would changing the rounding mode; that is, the tightest bound possible.
If your platform's C library is antediluvian and does not provide fma, there is still hope. Your claimed statement is correct (assuming no denormal values, at least -- I would need to think more about what happens for denormals); (1.0+DBL_EPSILON)*x/y really is always greater than or equal to the infinitely precise x/y. It will sometimes be one ulp larger than the smallest value with this property, but that's a very small and probably acceptable margin. The proof of these claims is pretty fussy, and probably not suitable for StackOverflow, but I'll give a quick sketch:
Ignoring denormals, it suffices to restrict ourselves to x, y in [1.0, 2.0).
(1.0 + eps)*x >= x + eps > x. To see this, observe:
(1.0 + eps)*x = x + x*eps >= x + eps > x.
Let P be the mathematically precise x/y. We have:
(1.0 + eps)*x/y >= (x + eps)/y = x/y + eps/y = P + eps/y
Now, y is bounded above by 2, so this gives us:
(1.0 + eps)*x/y > P + eps/2
which is sufficient to guarantee that the result rounds to a value >= P. This also shows us the way to a tighter bound. We could instead use nextafter(x,INFINITY)/y to get the desired effect with a tighter bound in many cases. (nextafter(x,INFINITY) is always x + ulp, whereas (1.0 + eps)*x will be x + 2ulp half of the time. If you want to avoid calling the nextafter library function, you can use (x + (0.75*DBL_EPSILON)*x) instead to get the same result, under the working assumption of positive normal values).
In order to be really pedantically correct, this would become significantly more complicated. No one really writes code like this, but it would be along these lines:
#include <math.h>
#pragma STDC FENV_ACCESS on
#if defined FE_UPWARD
int OldRoundingMode = fegetround();
if (OldRoundingMode < 0) goto Error;
if (fesetround(FE_UPWARD)) goto Error;
r = x/y;
if (fesetround(OldRoundingMode)) goto TrulyHosed;
return r;
TrulyHosed:
// we established the desired rounding mode and did our computation,
// but now we can't set it back to the original mode. I have no idea
// how you handle this gracefully.
Error:
#else
// we can't establish the desired rounding mode, so fall back on
// something else.

Resources