Efficient computation of 2**64 / divisor via fast floating-point reciprocal - c

I am currently looking into ways of using the fast single-precision floating-point reciprocal capability of various modern processors to compute a starting approximation for a 64-bit unsigned integer division based on fixed-point Newton-Raphson iterations. It requires computation of 264 / divisor, as accurately as possible, where the initial approximation must be smaller than, or equal to, the mathematical result, based on the requirements of the following fixed-point iterations. This means this computation needs to provide an underestimate. I currently have the following code, which works well, based on extensive testing:
#include <stdint.h> // import uint64_t
#include <math.h> // import nextafterf()
uint64_t divisor, recip;
float r, s, t;
t = uint64_to_float_ru (divisor); // ensure t >= divisor
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; // underestimate of 2**64 / divisor
While this code is functional, it isn't exactly fast on most platforms. One obvious improvement, which requires a bit of machine-specific code, is to replace the division r = 1.0f / t with code that makes use of a fast floating-point reciprocal provided by the hardware. This can be augmented with iteration to produce a result that is within 1 ulp of the mathematical result, so an underestimate is produced in the context of the existing code. A sample implementation for x86_64 would be:
#include <xmmintrin.h>
/* Compute 1.0f/a almost correctly rounded. Halley iteration with cubic convergence */
inline float fast_recip_f32 (float a)
{
__m128 t;
float e, r;
t = _mm_set_ss (a);
t = _mm_rcp_ss (t);
_mm_store_ss (&r, t);
e = fmaf (r, -a, 1.0f);
e = fmaf (e, e, e);
r = fmaf (e, r, r);
return r;
}
Implementations of nextafterf() are typically not performance optimized. On platforms where there are means to quickly re-interprete an IEEE 754 binary32 into an int32 and vice versa, via intrinsics float_as_int() and int_as_float(), we can combine use of nextafterf() and scaling as follows:
s = int_as_float (float_as_int (r) + 0x1fffffff);
Assuming these approaches are possible on a given platform, this leaves us with the conversions between float and uint64_t as major obstacles. Most platforms don't provide an instruction that performs a conversion from uint64_t to float with static rounding mode (here: towards positive infinity = up), and some don't offer any instructions to convert between uint64_t and floating-point types, making this a performance bottleneck.
t = uint64_to_float_ru (divisor);
r = fast_recip_f32 (t);
s = int_as_float (float_as_int (r) + 0x1fffffff);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */
A portable, but slow, implementation of uint64_to_float_ru uses dynamic changes to FPU rounding mode:
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
float uint64_to_float_ru (uint64_t a)
{
float res;
int curr_mode = fegetround ();
fesetround (FE_UPWARD);
res = (float)a;
fesetround (curr_mode);
return res;
}
I have looked into various splitting and bit-twiddling approaches to deal with the conversions (e.g. do the rounding on the integer side, then use a normal conversion to float which uses the IEEE 754 rounding mode round-to-nearest-or-even), but the overhead this creates makes this computation via fast floating-point reciprocal unappealing from a performance perspective. As it stands, it looks like I would be better off generating a starting approximation by using a classical LUT with interpolation, or a fixed-point polynomial approximation, and follow those up with a 32-bit fixed-point Newton-Raphson step.
Are there ways to improve the efficiency of my current approach? Portable and semi-portable ways involving intrinsics for specific platforms would be of interest (in particular for x86 and ARM as the currently dominant CPU architectures). Compiling for x86_64 using the Intel compiler at very high optimization (/O3 /QxCORE-AVX2 /Qprec-div-) the computation of the initial approximation takes more instructions than the iteration, which takes about 20 instructions. Below is the complete division code for reference, showing the approximation in context.
uint64_t udiv64 (uint64_t dividend, uint64_t divisor)
{
uint64_t temp, quot, rem, recip, neg_divisor = 0ULL - divisor;
float r, s, t;
/* compute initial approximation for reciprocal; must be underestimate! */
t = uint64_to_float_ru (divisor);
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */
/* perform Halley iteration with cubic convergence to refine reciprocal */
temp = neg_divisor * recip;
temp = umul64hi (temp, temp) + temp;
recip = umul64hi (recip, temp) + recip;
/* compute preliminary quotient and remainder */
quot = umul64hi (dividend, recip);
rem = dividend - divisor * quot;
/* adjust quotient if too small; quotient off by 2 at most */
if (rem >= divisor) quot += ((rem - divisor) >= divisor) ? 2 : 1;
/* handle division by zero */
if (divisor == 0ULL) quot = ~0ULL;
return quot;
}
umul64hi() would generally map to a platform-specific intrinsic, or a bit of inline assembly code. On x86_64 I currently use this implementation:
inline uint64_t umul64hi (uint64_t a, uint64_t b)
{
uint64_t res;
__asm__ (
"movq %1, %%rax;\n\t" // rax = a
"mulq %2;\n\t" // rdx:rax = a * b
"movq %%rdx, %0;\n\t" // res = (a * b)<63:32>
: "=rm" (res)
: "rm"(a), "rm"(b)
: "%rax", "%rdx");
return res;
}

This solution combines two ideas:
You can convert to floating point by simply reinterpreting the bits as floating point and subtracting a constant, so long as the number is within a particular range. So add a constant, reinterpret, and then subtract that constant. This will give a truncated result (which is therefore always less than or equal the desired value).
You can approximate reciprocal by negating both the exponent and the mantissa. This may be achieved by interpreting the bits as int.
Option 1 here only works in a certain range, so we check the range and adjust the constants used. This works in 64 bits because the desired float only has 23 bits of precision.
The result in this code will be double, but converting to float is trivial, and can be done on the bits or directly, depending on hardware.
After this you'd want to do the Newton-Raphson iteration(s).
Much of this code simply converts to magic numbers.
double
u64tod_inv( uint64_t u64 ) {
__asm__( "#annot0" );
union {
double f;
struct {
unsigned long m:52; // careful here with endianess
unsigned long x:11;
unsigned long s:1;
} u64;
uint64_t u64i;
} z,
magic0 = { .u64 = { 0, (1<<10)-1 + 52, 0 } },
magic1 = { .u64 = { 0, (1<<10)-1 + (52+12), 0 } },
magic2 = { .u64 = { 0, 2046, 0 } };
__asm__( "#annot1" );
if( u64 < (1UL << 52UL ) ) {
z.u64i = u64 + magic0.u64i;
z.f -= magic0.f;
} else {
z.u64i = ( u64 >> 12 ) + magic1.u64i;
z.f -= magic1.f;
}
__asm__( "#annot2" );
z.u64i = magic2.u64i - z.u64i;
return z.f;
}
Compiling this on an Intel core 7 gives a number of instructions (and a branch), but, of course, no multiplies or divides at all. If the casts between int and double are fast this should run pretty quickly.
I suspect float (with only 23 bits of precision) will require more than 1 or 2 Newton-Raphson iterations to get the accuracy you want, but I haven't done the math...

Related

fast double exp2 function in C [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
The community reviewed whether to reopen this question 9 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I need a version of the following fast exp2 function working in double precision, can you help me ? Please don't answer saying that it is an approximation so a double version is pointless and casting the result to (double) is enough..thanks. The function which I found somewhere and which works for me is the following and it is much faster than exp2f(), but unfortunately I could not find any double precision version:
inline float fastexp2(float p)
{
if(p<-126.f) p= -126.f;
int w=(int)p;
float z=p-(float)w;
if(p<0.f) z+= 1.f;
union {uint32_t i; float f;} v={(uint32_t)((1<<23)*(p+121.2740575f+27.7280233f/(4.84252568f -z)-1.49012907f * z)) };
return v.f;
}
My assumption is that the existing code from the question assumes IEEE-754 binary floating-point computation, in particular mapping C's float type to IEEE-754's binary32 format.
The existing code suggests that only floating-point results in the normal range are of interest: subnormal results are avoided by clamping the input from below and overflow is ignored. So for float computation valid inputs are in the interval [-126, 128). By exhaustive test, I found that the maximum relative error of the function in the question is 7.16e-5, and that the root-mean-square (RMS) error is 1.36e-5.
My assumption is that the desired change to double computation should widen the range of allowed inputs to [-1022, 1024), and that identical relative accuracy should be maintained. The code is written in a fairly cryptic fashion. So as a first step, I rearranged it into a more readable version. In a second step, I tweaked the coefficients of the core approximation so as to minimize the maximum relative error. This results in the following ISO-C99 code:
/* compute 2**p, for p in [-126, 128). Maximum relative error: 5.04e-5; RMS error: 1.03e-5 */
float fastexp2 (float p)
{
const int FP32_MIN_EXPO = -126; // exponent of minimum binary32 normal
const int FP32_MANT_BITS = 23; // number of stored mantissa (significand) bits
const int FP32_EXPO_BIAS = 127; // binary32 exponent bias
float res;
p = (p < FP32_MIN_EXPO) ? FP32_MIN_EXPO : p; // clamp below
/* 2**p = 2**(w+z), with w an integer and z in [0, 1) */
float w = floorf (p); // integral part
float z = p - w; // fractional part
/* approximate 2**z-1 for z in [0, 1) */
float approx = -0x1.6e7592p+2f + 0x1.bba764p+4f / (0x1.35ed00p+2f - z) - 0x1.f5e546p-2f * z;
/* assemble the exponent and mantissa components into final result */
int32_t resi = ((1 << FP32_MANT_BITS) * (w + FP32_EXPO_BIAS + approx));
memcpy (&res, &resi, sizeof res);
return res;
}
Refactoring and retuning of the coefficients resulted in slight improvements in accuracy, with maximum relative error of 5.04e-5 and RMS error of 1.03e-5. It should be noted that floating-point arithmetic is generally not associative, therefore any re-association of floating-point operations, either by manual code changes or compiler transformations could affect the stated accuracy and requires careful re-testing.
I do not expect any need to modify the code as it compiles into efficient machine code for common architectures, as can be seen from trial compilations with Compiler Explorer, e.g. gcc with x86-64 or gcc with ARM64.
At this point it is obvious what needs to be changed for switching to double computation. Change all instances of float to double, all instances of int32_t to int64_t, change type suffixes for literal constants and math functions, and change the floating-point format specific parameters for IEEE-754 binary32 to those for IEEE-754 binary64. The core approximation needs re-tuning to make best possible use of double precision coefficients in the core approximation.
/* compute 2**p, for p in [-1022, 1024). Maximum relative error: 4.93e-5. RMS error: 9.91e-6 */
double fastexp2 (double p)
{
const int FP64_MIN_EXPO = -1022; // exponent of minimum binary64 normal
const int FP64_MANT_BITS = 52; // number of stored mantissa (significand) bits
const int FP64_EXPO_BIAS = 1023; // binary64 exponent bias
double res;
p = (p < FP64_MIN_EXPO) ? FP64_MIN_EXPO : p; // clamp below
/* 2**p = 2**(w+z), with w an integer and z in [0, 1) */
double w = floor (p); // integral part
double z = p - w; // fractional part
/* approximate 2**z-1 for z in [0, 1) */
double approx = -0x1.6e75d58p+2 + 0x1.bba7414p+4 / (0x1.35eccbap+2 - z) - 0x1.f5e53c2p-2 * z;
/* assemble the exponent and mantissa components into final result */
int64_t resi = ((1LL << FP64_MANT_BITS) * (w + FP64_EXPO_BIAS + approx));
memcpy (&res, &resi, sizeof res);
return res;
}
Both maximum relative error and root-mean-square error decreases very slightly to 4.93e-5 and 9.91e-6, respectively. This is as expected, because for an approximation that is roughly accurate to 15 bits it matters little whether intermediate computation is performed with 24 bits of precision or 53 bits of precision. The computation uses a division, and this tends to be slower for double than for float on all platforms I am familiar with, so the double-precision port doesn't seem to provide any significant advantages other than perhaps eliminating conversion overhead if the calling code uses double-precision computation.

Why do we multiply normalized fraction with 0.5 to get the significand in IEEE 754 representation?

I have a question about the pack754() function defined in Section 7.4 of Beej's Guide to Network Programming.
This function converts a floating point number f into its IEEE 754 representation where bits is the total number of bits to represent the number and expbits is the number of bits used to represent only the exponent.
I am concerned with single-precision floating numbers only, so for this question, bits is specified as 32 and expbits is specified as 8. This implies that 23 bits is used to store the significand (because one bit is sign bit).
My question is about this line of code.
significand = fnorm * ((1LL<<significandbits) + 0.5f);
What is the role of + 0.5f in this code?
Here is a complete code that uses this function.
#include <stdio.h>
#include <stdint.h> // defines uintN_t types
#include <inttypes.h> // defines PRIx macros
uint64_t pack754(long double f, unsigned bits, unsigned expbits)
{
long double fnorm;
int shift;
long long sign, exp, significand;
unsigned significandbits = bits - expbits - 1; // -1 for sign bit
if (f == 0.0) return 0; // get this special case out of the way
// check sign and begin normalization
if (f < 0) { sign = 1; fnorm = -f; }
else { sign = 0; fnorm = f; }
// get the normalized form of f and track the exponent
shift = 0;
while(fnorm >= 2.0) { fnorm /= 2.0; shift++; }
while(fnorm < 1.0) { fnorm *= 2.0; shift--; }
fnorm = fnorm - 1.0;
// calculate the binary form (non-float) of the significand data
significand = fnorm * ((1LL<<significandbits) + 0.5f);
// get the biased exponent
exp = shift + ((1<<(expbits-1)) - 1); // shift + bias
// return the final answer
return (sign<<(bits-1)) | (exp<<(bits-expbits-1)) | significand;
}
int main(void)
{
float f = 3.1415926;
uint32_t fi;
printf("float f: %.7f\n", f);
fi = pack754(f, 32, 8);
printf("float encoded: 0x%08" PRIx32 "\n", fi);
return 0;
}
What purpose does + 0.5f serve in this code?
The code is an incorrect attempt at rounding.
long double fnorm;
long long significand;
unsigned significandbits
...
significand = fnorm * ((1LL<<significandbits) + 0.5f); // bad code
The first clue of incorrectness is the f of 0.5f, which indicates float, is a nonsensical introduction of specifying float in a routine with long double f and fnorm. float math has no application in the function.
Yet adding 0.5f does not mean that the code is limited to float math in (1LL<<significandbits) + 0.5f. See FLT_EVAL_METHOD which may allow higher precision intermediate results and have fooled the code author in testing.
A rounding attempt does make sense as the argument is long double and the target representations are narrower. Adding 0.5 is a common approach - but it is not done right here. IMO, the lack of the author commenting here concerning 0.5f hinted that the intent was "obvious" - not subtle, albeit incorrect.
As commented, moving the 0.5 is closer to being correct for rounding, but may mis-lead some into thinking the addition is done with float math, (it is long double math adding a long doubleproduct to float causes the 0.5f to be promoted to long double first).
// closer to rounding but may mislead
significand = fnorm * (1LL<<significandbits) + 0.5f;
// better
significand = fnorm * (1LL<<significandbits) + 0.5L; // or 0.5l or simply 0.5
To round, without calling the preferred <math.h> rounds routines like rintl(), roundl(), nearbyintl(), llrintl(), adding the explicit type 0.5 is still a weak attempt at rounding. It is weak because it rounds incorrectly with many cases. The +0.5 trick relies on that sum being exact.
Consider
long double product = fnorm * (1LL<<significandbits);
long long significand = product + 0.5; // double rounding?
product + 0.5 itself may go through a rounding before truncation/assignment to long long - in effect double rounding.
Best to use the right tool in the C shed of standard library functions.
significand = llrintl(fnorm * (1ULL<<significandbits));
A corner case remains with this rounding is where significand is now one too great and significand , exp needs adjustment. As well identified by #Nayuki, code has other short-comings too. Also, it fails on -0.0.
The + 0.5f serves no purpose in the code, and may be harmful or misleading.
The expression (1LL<<significandbits) + 0.5f results in a float. But even for the small case of significandbits = 23 for single-precision floating-point, the expression evaluates to (float)(223 + 0.5), which rounds to exactly 223 (round half even).
Replacing + 0.5f with + 0.0f results in the same behavior. Heck, drop that term entirely, because fnorm will cause the right-hand side argument of * to be casted to long double anyway. This would be a better way to rewrite the line: long long significand = fnorm * (long double)(1LL << significandbits);
Side note: This implementation of pack754() handles zero correctly (and collapses negative zero to positive zero), but mishandles subnormal numbers (wrong bits), infinities (infinite loop), and NaN (wrong bits). It's best to not treat it as a reference model function.

Most accurate way to compute asinhf() from log1pf()?

The inverse hyperbolic function asinh() is closely related to the natural logarithm. I am trying to determine the most accurate way to compute asinh() from the C99 standard math function log1p(). For ease of experimentation, I am limiting myself to IEEE-754 single-precision computation right now, that is I am looking at asinhf() and log1pf(). I intend to re-use the exact same algorithm for double precision computation, i.e. asinh() and log1p(), later.
My primary goal is to minimize ulp error, the secondary goal is to minimize the number of incorrectly rounded results, under the constraint that the improved code would at most be minimally slower than the versions posted below. Any incremental improvement to accuracy, say 0.2 ulp, would be welcome. Adding a couple of FMAs (fused multiply-adds) would be fine, on the other hand I am hoping someone could identify a solution which employs a fast rsqrtf() (reciprocal square root).
The resulting C99 code should lend itself to vectorization, possibly by some minor straightforward transformations. All intermediate computation must occur at the precision of the function argument and result, as any switch to higher precision may have a severe negative performance impact. The code must work correctly both with IEEE-754 denormal support and in FTZ (flush to zero) mode.
So far, I have identified the following two candidate implementations. Note that the code may be easily transformed into a branchless vectorizable version with a single call to log1pf(), but I have not done so at this stage to avoid unnecessary obfuscation.
/* for a >= 0, asinh(a) = log (a + sqrt (a*a+1))
= log1p (a + (sqrt (a*a+1) - 1))
= log1p (a + sqrt1pm1 (a*a))
= log1p (a + (a*a / (1 + sqrt(a*a + 1))))
= log1p (a + a * (a / (1 + sqrt(a*a + 1))))
= log1p (fma (a / (1 + sqrt(a*a + 1)), a, a)
= log1p (fma (1 / (1/a + sqrt(1/a*a + 1)), a, a)
*/
float my_asinhf (float a)
{
float fa, t;
fa = fabsf (a);
#if !USE_RECIPROCAL
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = fmaf (fa / (1.0f + sqrtf (fmaf (fa, fa, 1.0f))), fa, fa);
t = log1pf (t);
}
#else // USE_RECIPROCAL
if (fa > 0x1.0p126f) { // prevent underflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = 1.0f / fa;
t = fmaf (1.0f / (t + sqrtf (fmaf (t, t, 1.0f))), fa, fa);
t = log1pf (t);
}
#endif // USE_RECIPROCAL
return copysignf (t, a); // restore sign
}
With a particular log1pf() implementation that is accurate to < 0.6 ulps, I am observing the following error statistics when testing exhaustively across all 232 possible IEEE-754 single-precision inputs. When USE_RECIPROCAL = 0, the maximum error is 1.49486 ulp, and there are 353,587,822 incorrectly rounded results. With USE_RECIPROCAL = 1, the maximum error is 1.50805 ulp, and there are only 77,569,390 incorrectly rounded results.
In terms of performance, the variant USE_RECIPROCAL = 0 will be faster if reciprocals and full divisions take roughly the same amount of time, but the variant USE_RECIPROCAL = 1 could be faster if very fast reciprocal support is available.
Answers can assume that all basic arithmetic, including FMA (fused multiply-add) is correctly rounded according to IEEE-754 round-to-nearest-or-even mode. In addition, faster, nearly correctly rounded, versions of reciprocal and rsqrtf() may be available, where "nearly correctly rounded" means the maximum ulp error will be limited to something like 0.53 ulps and the overwhelming majority of results, say > 95%, are correctly rounded. Basic arithmetic with directed roundings may be available at no additional cost to performance.
Firstly, you may want to look into the accuracy and speed of your log1pf function: these can vary a bit between libms (I've found the OS X math functions to be fast, the glibc ones to be slower but typically correctly rounded).
Openlibm, based on the BSD libm, which in turn is based on Sun's fdlibm, use multiple approaches by range, but the main bit is the relation:
t = x*x;
w = log1pf(fabsf(x)+t/(one+sqrtf(one+t)));
You may also want to try compiling with the -fno-math-errno option, which disables the old System V error codes for sqrt (IEEE-754 exceptions will still work).
After various additional experiments, I have convinced myself that a simple argument transformation that does not use higher precision than the argument and result cannot achieve a tighter error bound than the one achieved by the first variant in the code I posted.
Since my question is about minimizing the error for the argument transformation which is incurred in addition to the error in log1pf() itself, the most straightforward approach to use for experimentation is to utilize a correctly rounded implementation of that logarithm function. Note that a correctly-rounded implementation is highly unlikely to exist in the context of a high-performance environment. According to the works of J.-M. Muller et. al., to produce accurate single-precision results, x86 extended precision computation should be sufficient, for example
float accurate_log1pf (float a)
{
float res;
__asm fldln2;
__asm fld dword ptr [a];
__asm fyl2xp1;
__asm fst dword ptr [res];
__asm fcompp;
return res;
}
An implementation of asinhf() using the first variant from my question then looks as follows:
float my_asinhf (float a)
{
float fa, s, t;
fa = fabsf (a);
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
t = fmaf (fa / (1.0f + sqrtf (fmaf (fa, fa, 1.0f))), fa, fa);
t = accurate_log1pf (t);
}
return copysignf (t, a); // restore sign
}
Testing with all 232 IEEE-754 single-precision operands shows that the maximum error of 1.49486070 ulp occurs at ±0x1.ff5022p-9 and there are 353,521,140 incorrectly rounded results. What happens if the entire argument transformation uses double-precision arithmetic? The code changes to
float my_asinhf (float a)
{
float fa, s, t;
fa = fabsf (a);
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
double tt = fa;
tt = fma (tt / (1.0 + sqrt (fma (tt, tt, 1.0))), tt, tt);
t = (float)tt;
t = accurate_log1pf (t);
}
return copysignf (t, a); // restore sign
}
However, the error bound does not improve with this change! The maximum error of 1.49486070 ulp still occurs at ±0x1.ff5022p-9 and there are now 350,971,046 incorrectly rounded results, slightly fewer than before. The issue seems to be that a float operand cannot convey enough information to log1pf() to produce more accurate results. A similar problem occurs when computing sinf() and cosf(). If the reduced argument, represented as a correctly rounded float operand, is passed to the core polynomials, the resulting error in sinf() and cosf() is just a tad under 1.5 ulp, just as we are observing here with my_asinhf().
One solution is to compute the transformed argument to higher than single precision, for example as a double-float operand pair (a useful brief overwiew of double-float techniques can be found in this paper by Andrew Thall). In this case, we can use the additional information to perform linear interpolation on the result, based on the knowledge that the derivative of the logarithm is the reciprocal. This gives us:
float my_asinhf (float a)
{
float fa, s, t;
fa = fabsf (a);
if (fa >= 0x1.0p64f) { // prevent overflow in intermediate computation
t = log1pf (fa) + 0x1.62e430p-1f; // log(2)
} else {
double tt = fa;
tt = fma (tt / (1.0 + sqrt (fma (tt, tt, 1.0))), tt, tt);
t = (float)tt; // "head" of double-float
s = (float)(tt - (double)t); // "tail" of double-float
t = fmaf (s, 1.0f / (1.0f + t), accurate_log1pf (t)); // interpolate
}
return copysignf (t, a); // restore sign
}
Exhaustive test of this version indicates that the maximum error has been reduced to 0.99999948 ulp, it occurs at ±0x1.deeea0p-22. There are 349,653,534 incorrectly rounded results. A faithfully-rounded implementation of asinhf() has been achieved.
Unfortunately, the practical utility of this result is limited. Depending on HW platform, the throughput of arithmetic operations on double may only be 1/2 to 1/32 of the throughput of float operations. The double-precision computation can be replaced with double-float computation, but this would incur very significant cost as well. Lastly, my approach here was to use the single-precision implementation as a proving ground for subsequent double-precision work, and many hardware platforms (certainly all the ones I am interested in) do not offer hardware support for a numeric format with higher precision than IEEE-754 binary64 (double precision). Therefore any solution should not require higher-precision arithmetic in intermediate computation.
Since all the troublesome arguments in the case of asinhf() are small in magnitude, one could [partially?] address the accuracy issue by using a polynomial minimax approximation for the region around the origin. As this would create another code branch, it would likely make vectorization more difficult.

fast square root optimization?

If you check this very nice page:
http://www.codeproject.com/Articles/69941/Best-Square-Root-Method-Algorithm-Function-Precisi
You'll see this program:
#define SQRT_MAGIC_F 0x5f3759df
float sqrt2(const float x)
{
const float xhalf = 0.5f*x;
union // get bits for floating value
{
float x;
int i;
} u;
u.x = x;
u.i = SQRT_MAGIC_F - (u.i >> 1); // gives initial guess y0
return x*u.x*(1.5f - xhalf*u.x*u.x);// Newton step, repeating increases accuracy
}
My question is: Is there any particular reason why this isn't implemented as:
#define SQRT_MAGIC_F 0x5f3759df
float sqrt2(const float x)
{
union // get bits for floating value
{
float x;
int i;
} u;
u.x = x;
u.i = SQRT_MAGIC_F - (u.i >> 1); // gives initial guess y0
const float xux = x*u.x;
return xux*(1.5f - .5f*xux*u.x);// Newton step, repeating increases accuracy
}
As, from disassembly, I see one MUL less. Is there any purpose to having xhalf appear at all?
It could be that legacy floating point math, which used 80 bit registers, was more accurate when the multipliers where linked together in the last line as intermediate results where kept in 80 bit registers.
The first multiplication in the upper implementation takes place in parallel to the integer math that follows, they use different execution resources.
The second function on the other hand looks faster but it's hard to tell if it really is because of the above.
Also, the const float xux = x*u.x; statement reduces the result back to 32 bit float, which may reduce overall accuracy.
You could test these functions head to head and compare them to the sqrt function in math.h (use double not float). This way you can see which is faster and which is more accurate.

Implementing single-precision division as double-precision multiplication

Question
For a C99 compiler implementing exact IEEE 754 arithmetic, do values of f, divisor of type float exist such that f / divisor != (float)(f * (1.0 / divisor))?
EDIT: By “implementing exact IEEE 754 arithmetic” I mean a compiler that rightfully defines FLT_EVAL_METHOD as 0.
Context
A C compiler that provides IEEE 754-compliant floating-point can only replace a single-precision division by a constant by a single-precision multiplication by the inverse if said inverse is itself representable exactly as a float.
In practice, this only happens for powers of two. So a programmer, Alex, may be confident that f / 2.0f will be compiled as if it had been f * 0.5f, but if it is acceptable for Alex to multiply by 0.10f instead of dividing by 10, Alex should express it by writing the multiplication in the program, or by using a compiler option such as GCC's -ffast-math.
This question is about transforming a single-precision division into a double-precision multiplication. Does it always produce the correctly rounded result? Is there a chance that it could be cheaper, and thus be an optimization that compilers might make (even without -ffast-math)?
I have compared (float)(f * 0.10) and f / 10.0f for all single-precision values of f between 1 and 2, without finding any counter-example. This should cover all divisions of normal floats producing a normal result.
Then I generalized the test to all divisors with the program below:
#include <float.h>
#include <math.h>
#include <stdio.h>
int main(void){
for (float divisor = 1.0; divisor != 2.0; divisor = nextafterf(divisor, 2.0))
{
double factor = 1.0 / divisor; // double-precision inverse
for (float f = 1.0; f != 2.0; f = nextafterf(f, 2.0))
{
float cr = f / divisor;
float opt = f * factor; // double-precision multiplication
if (cr != opt)
printf("For divisor=%a, f=%a, f/divisor=%a but (float)(f*factor)=%a\n",
divisor, f, cr, opt);
}
}
}
The search space is just large enough to make this interesting (246). The program is currently running. Can someone tell me whether it will print something, perhaps with an explanation why or why not, before it has finished?
Your program won't print anything, assuming round-ties-to-even rounding mode. The essence of the argument is as follows:
We're assuming that both f and divisor are between 1.0 and 2.0. So f = a / 2^23 and divisor = b / 2^23 for some integers a and b in the range [2^23, 2^24). The case divisor = 1.0 isn't interesting, so we can further assume that b > 2^23.
The only way that (float)(f * (1.0 / divisor)) could give the wrong result would be for the exact value f / divisor to be so close to a halfway case (i.e., a number exactly halfway between two single-precision floats) that the accumulated errors in the expression f * (1.0 / divisor) push us to the other side of that halfway case from the true value.
But that can't happen. For simplicity, let's first assume that f >= divisor, so that the exact quotient is in [1.0, 2.0). Now any halfway case for single precision in the interval [1.0, 2.0) has the form c / 2^24 for some odd integer c with 2^24 < c < 2^25. The exact value of f / divisor is a / b, so the absolute value of the difference f / divisor - c / 2^24 is bounded below by 1 / (2^24 b), so is at least 1 / 2^48 (since b < 2^24). So we're more than 16 double-precision ulps away from any halfway case, and it should be easy to show that the error in the double precision computation can never exceed 16 ulps. (I haven't done the arithmetic, but I'd guess it's easy to show an upper bound of 3 ulps on the error.)
So f / divisor can't be close enough to a halfway case to create problems. Note that f / divisor can't be an exact halfway case, either: since c is odd, c and 2^24 are relatively prime, so the only way we could have c / 2^24 = a / b is if b is a multiple of 2^24. But b is in the range (2^23, 2^24), so that's not possible.
The case where f < divisor is similar: the halfway cases then have the form c / 2^25 and the analogous argument shows that abs(f / divisor - c / 2^25) is greater than 1 / 2^49, which again gives us a margin of 16 double-precision ulps to play with.
It's certainly not possible if non-default rounding modes are possible. For example, in replacing 3.0f / 3.0f with 3.0f * C, a value of C less than the exact reciprocal would yield the wrong result in downward or toward-zero rounding modes, whereas a value of C greater than the exact reciprocal would yield the wrong result for upward rounding mode.
It's less clear to me whether what you're looking for is possible if you restrict to default rounding mode. I'll think about it and revise this answer if I come up with anything.
Random search resulted in an example.
Looks like when the result is a "denormal/subnormal" number, the inequality is possible. But then, maybe my platform is not IEEE 754 compliant?
f 0x1.7cbff8p-25
divisor -0x1.839p+116
q -0x1.f8p-142
q2 -0x1.f6p-142
int MyIsFinite(float f) {
union {
float f;
unsigned char uc[sizeof (float)];
unsigned long ul;
} x;
x.f = f;
return (x.ul & 0x7F800000L) != 0x7F800000L;
}
float floatRandom() {
union {
float f;
unsigned char uc[sizeof (float)];
} x;
do {
size_t i;
for (i=0; i<sizeof(x.uc); i++) x.uc[i] = rand();
} while (!MyIsFinite(x.f));
return x.f;
}
void testPC() {
for (;;) {
volatile float f, divisor, q, qd;
do {
f = floatRandom();
divisor = floatRandom();
q = f / divisor;
} while (!MyIsFinite(q));
qd = (float) (f * (1.0 / divisor));
if (qd != q) {
printf("%a %a %a %a\n", f, divisor, q, qd);
return;
}
}
}
Eclipse PC Version: Juno Service Release 2
Build id: 20130225-0426

Resources