Related
Follow-up question for IEEE 754 conformant sqrt() implementation for double type.
Context: Need to implement IEEE 754 conformant sqrtf() taking into account the following HW restrictions and usage limitations:
Provides a special instruction qseed.f to get an approximation of the reciprocal of the square root (the accuracy of the result is no less than 6.75 bits, and therefore always within ±1% of the accurate result).
Single precision FP:
a. Support by HW (SP FPU): has support;
b. Support by SW (library): has support;
c. Support of subnormal numbers: no support (FLT_HAS_SUBNORM is 0).
Double precision FP:
a. Support by HW (DP FPU): no support;
b. Support by SW (library): has support;
c. Support of subnormal numbers: no support (DBL_HAS_SUBNORM is 0).
I've found one presentation by John Harrison and ended up with this implementation (note that here qseed.f is replaced by rsqrtf()):
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
// https://github.com/nickzman/hyperspace/blob/master/frsqrt.hh
#if 1
float rsqrtf ( float x )
{
const float xhalf = 0.5f * x;
int i = *(int*) & x;
i = 0x5f375a86 - ( i >> 1 );
x = *(float*) & i;
x = x * ( 1.5f - xhalf * x * x );
x = x * ( 1.5f - xhalf * x * x );
x = x * ( 1.5f - xhalf * x * x );
return x;
}
#else
float rsqrtf ( float x )
{
return 1.0f / sqrtf( x );
}
#endif
float sqrtfr_jh( float x, float r )
{
/*
* John Harrison, Formal Verification Methods 5: Floating Point Verification,
* Intel Corporation, 12 December 2002, document name: slides5.pdf, page 14,
* slide "The square root algorithm".
* URL: https://www.cl.cam.ac.uk/~jrh13/slides/anu-09_12dec02/slides5.pdf
*/
double rd, b, z0, s0, d, k, h0, e, t0, s1, c, d1, h1, s;
static const double half = 0.5;
static const double one = 1.0;
static const double three = 3.0;
static const double two = 2.0;
rd = (double)r;
b = half * x;
z0 = rd * rd;
s0 = x * rd;
d = fma( -b, z0, half );
k = fma( x, rd, -s0 );
h0 = half * rd;
e = fma( three / two, d, one );
t0 = fma( d, s0, k );
s1 = fma( e, t0, s0 );
c = fma( d, e, one );
d1 = fma( -s1, s1, x );
h1 = c * h0;
s = fma( d1, h1, s1 );
return (float)s;
}
float my_sqrtf( float x )
{
/* handle special cases */
if (x == 0) {
return x + x;
}
/* handle normal cases */
if ((x > 0) && (x < INFINITY)) {
return sqrtfr_jh( x, rsqrtf( x ) );
}
/* handle special cases */
return (x < 0) ? NAN : (x + x);
}
/*
https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
int main (void)
{
const uint64_t N = 10000000000ULL; /* desired number of test cases */
float arg, ref, res;
uint64_t argi64;
uint32_t refi, resi;
uint64_t count = 0;
float spec[] = {0.0f, 1.0f, INFINITY, NAN};
printf ("test a few special cases:\n");
for (int i = 0; i < sizeof (spec)/sizeof(spec[0]); i++) {
printf ("my_sqrt(%a) = %a\n", spec[i], my_sqrtf(spec[i]));
printf ("my_sqrt(%a) = %a\n", -spec[i], my_sqrtf(-spec[i]));
}
printf ("test %lu random cases:\n", N);
do {
argi64 = KISS64;
memcpy (&arg, &argi64, sizeof arg);
if ( fpclassify(arg) == FP_SUBNORMAL )
{
continue;
}
++count;
res = my_sqrtf (arg);
ref = sqrtf (arg);
memcpy (&resi, &res, sizeof resi);
memcpy (&refi, &ref, sizeof refi);
if ( ! ( isnan(res) && isnan(ref) ) )
if (resi != refi) {
printf ("\rerror # arg=%a (%e)\n", arg, arg);
printf ("\rerror # res=%a (%e)\n", res, res);
printf ("\rerror # ref=%a (%e)\n", ref, ref);
return EXIT_FAILURE;
}
if ((count & 0xfffff) == 0) printf ("\r[%lu]", count);
} while (count < N);
printf ("\r[%lu]", count);
printf ("\ntests PASSED\n");
return EXIT_SUCCESS;
}
And it seems to work correctly (at least for some random cases): it reports:
[10000000000]
tests PASSED
Now the question: since the original John Harrison's sqrtf() algorithm uses only single precision computations (i.e. type float), it is possible to reduce the number of operations when using only (except conversions) double precision computations (i.e. type double) and still be IEEE 754 conformant?
P.S. Since users #njuffa and #chux - Reinstate Monica are strong in FP, I invite them to participate. However, all the competent in FP users are welcome.
Computing a single-precision square root via double-precision code is going to be inefficient, especially if the hardware provides no native double-precision operations.
The following assumes hardware that conforms to IEEE-754 (2008), except that subnormals are not supported and flushed to zero. Fused-multiply add (FMA) is supported. It further assumes an ISO-C99 compiler that maps float to IEEE-754 binary32, and that maps the hardware's single-precision FMA instruction to the standard math function fmaf().
From a hardware starting approximation for the reciprocal square root with a maximum relative error of 2-6.75 one can get to a reciprocal square root accurate to 1 single-precision ulp with two Newton-Raphson iterations. Multiplying this with the original argument provides an accurate estimate of the square root. The square of this approximation is subtracted from the orginal argument to compute the approximation error for the square root. This error is then used to apply a correction to the square root approximation, resulting in a correctly-rounded square root.
However, this straightforward algorithm breaks down for arguments that are very small due to underflow or overflow in intermediate computation, in particular when the underlying arithmetic operates in flash-to-zero mode that flushes subnormals to zero. For such arguments we can construct a slowpath code that scales the input towards unity, and scales back the result accordingly once the square root has been computed. Code for handling special operands such as zeros, infinities, NaNs, and negative arguments other than zero is also added to this slowpath code.
The NaN generated by the slowpath code for invalid operations should be adjusted to match the system's existing operations. For example, for x86-based systems this would be a special QNaN called INDEFINITE, with a bit pattern of 0xffc00000, while for a GPU running CUDA it would be the canonical single-precision NaN with a bit pattern of 0x7fffffff.
For performance reasons it may be useful to inline the fastpath code while making the slowpath code a called outlined subroutine. Single-precision math functions with a single argument should always be tested exhaustively against a "golden" reference implementation, which takes just minutes on modern hardware.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
float uint32_as_float (uint32_t);
uint32_t float_as_uint32 (float);
float qseedf (float);
float sqrtf_slowpath (float);
/* Square root computation for IEEE-754 binary32 mapped to 'float' */
float my_sqrtf (float arg)
{
const uint32_t upper = float_as_uint32 (0x1.fffffep+127f);
const uint32_t lower = float_as_uint32 (0x1.000000p-102f);
float rsq, sqt, err;
/* use fastpath computation if argument in [0x1.0p-102, 0x1.0p+128) */
if ((float_as_uint32 (arg) - lower) <= (upper - lower)) {
/* generate low-accuracy approximation to rsqrt(arg) */
rsq = qseedf (arg);
/* apply two Newton-Raphson iterations with quadratic convergence */
rsq = fmaf (fmaf (-0.5f * arg * rsq, rsq, 0.5f), rsq, rsq);
rsq = fmaf (fmaf (-0.5f * arg * rsq, rsq, 0.5f), rsq, rsq);
/* compute sqrt from rsqrt, round result to nearest or even */
sqt = rsq * arg;
err = fmaf (sqt, -sqt, arg);
sqt = fmaf (0.5f * rsq, err, sqt);
} else {
sqt = sqrtf_slowpath (arg);
}
return sqt;
}
/* interprete bit pattern of 32-bit unsigned integer as IEEE-754 binary32 */
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
/* interprete bit pattern of IEEE-754 binary32 as a 32-bit unsigned integer */
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
/* simulate low-accuracy hardware approximation to 1/sqrt(a) */
float qseedf (float a)
{
float r = 1.0f / sqrtf (a);
r = uint32_as_float (float_as_uint32 (r) & ~0x1ffff);
return r;
}
/* square root computation suitable for all IEEE-754 binary32 arguments */
float sqrtf_slowpath (float arg)
{
const float FP32_INFINITY = uint32_as_float (0x7f800000);
const float FP32_QNAN = uint32_as_float (0xffc00000); /* system specific */
const float scale_in = 0x1.0p+26f;
const float scale_out = 0x1.0p-13f;
float rsq, err, sqt;
if (arg < 0.0f) {
return FP32_QNAN;
} else if ((arg == 0.0f) || !(fabsf (arg) < FP32_INFINITY)) { /* Inf, NaN */
return arg + arg;
} else {
/* scale subnormal arguments towards unity */
arg = arg * scale_in;
/* generate low-accuracy approximation to rsqrt(arg) */
rsq = qseedf (arg);
/* apply two Newton-Raphson iterations with quadratic convergence */
rsq = fmaf (fmaf (-0.5f * arg * rsq, rsq, 0.5f), rsq, rsq);
rsq = fmaf (fmaf (-0.5f * arg * rsq, rsq, 0.5f), rsq, rsq);
/* compute sqrt from rsqrt, round to nearest or even */
sqt = rsq * arg;
err = fmaf (sqt, -sqt, arg);
sqt = fmaf (0.5f * rsq, err, sqt);
/* compensate scaling of argument by counter-scaling the result */
sqt = sqt * scale_out;
return sqt;
}
}
int main (void)
{
uint32_t ai, resi, refi;
float a, res, reff;
double ref;
ai = 0x00000000;
do {
a = uint32_as_float (ai);
res = my_sqrtf (a);
ref = sqrt ((double)a);
reff = (float)ref;
resi = float_as_uint32 (res);
refi = float_as_uint32 (reff);
if (resi != refi) {
printf ("error # %08x %15.8e res=%08x %15.8e ref=%08x %15.8e\n",
ai, a, resi, res, refi, reff);
return EXIT_FAILURE;
}
ai++;
} while (ai);
return EXIT_SUCCESS;
}
I'm looking for an approximation of the natural exponential function operating on SSE element. Namely - __m128 exp( __m128 x ).
I have an implementation which is quick but seems to be very low in accuracy:
static inline __m128 FastExpSse(__m128 x)
{
__m128 a = _mm_set1_ps(12102203.2f); // (1 << 23) / ln(2)
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128 m87 = _mm_set1_ps(-87);
// fast exponential function, x should be in [-87, 87]
__m128 mask = _mm_cmpge_ps(x, m87);
__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}
Could anybody have an implementation with better accuracy yet as fast (Or faster)?
I'd be happy if it is written in C Style.
Thank You.
The C code below is a translation into SSE intrinsics of an algorithm I used in a previous answer to a similar question.
The basic idea is to transform the computation of the standard exponential function into computation of a power of 2: expf (x) = exp2f (x / logf (2.0f)) = exp2f (x * 1.44269504). We split t = x * 1.44269504 into an integer i and a fraction f, such that t = i + f and 0 <= f <= 1. We can now compute 2f with a polynomial approximation, then scale the result by 2i by adding i to the exponent field of the single-precision floating-point result.
One problem that exists with an SSE implementation is that we want to compute i = floorf (t), but there is no fast way to compute the floor() function. However, we observe that for positive numbers, floor(x) == trunc(x), and that for negative numbers, floor(x) == trunc(x) - 1, except when x is a negative integer. However, since the core approximation can handle an f value of 1.0f, using the approximation for negative arguments is harmless. SSE provides an instruction to convert single-precision floating point operands to integers with truncation, so this solution is efficient.
Peter Cordes points out that SSE4.1 supports a fast floor function _mm_floor_ps(), so a variant using SSE4.1 is also shown below. Not all toolchains automatically predefine the macro __SSE4_1__ when SSE 4.1 code generation is enabled, but gcc does.
Compiler Explorer (Godbolt) shows that gcc 7.2 compiles the code below into sixteen instructions for plain SSE and twelve instructions for SSE 4.1.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <emmintrin.h>
#ifdef __SSE4_1__
#include <smmintrin.h>
#endif
/* max. rel. error = 1.72863156e-3 on [-87.33654, 88.72283] */
__m128 fast_exp_sse (__m128 x)
{
__m128 t, f, e, p, r;
__m128i i, j;
__m128 l2e = _mm_set1_ps (1.442695041f); /* log2(e) */
__m128 c0 = _mm_set1_ps (0.3371894346f);
__m128 c1 = _mm_set1_ps (0.657636276f);
__m128 c2 = _mm_set1_ps (1.00172476f);
/* exp(x) = 2^i * 2^f; i = floor (log2(e) * x), 0 <= f <= 1 */
t = _mm_mul_ps (x, l2e); /* t = log2(e) * x */
#ifdef __SSE4_1__
e = _mm_floor_ps (t); /* floor(t) */
i = _mm_cvtps_epi32 (e); /* (int)floor(t) */
#else /* __SSE4_1__*/
i = _mm_cvttps_epi32 (t); /* i = (int)t */
j = _mm_srli_epi32 (_mm_castps_si128 (x), 31); /* signbit(t) */
i = _mm_sub_epi32 (i, j); /* (int)t - signbit(t) */
e = _mm_cvtepi32_ps (i); /* floor(t) ~= (int)t - signbit(t) */
#endif /* __SSE4_1__*/
f = _mm_sub_ps (t, e); /* f = t - floor(t) */
p = c0; /* c0 */
p = _mm_mul_ps (p, f); /* c0 * f */
p = _mm_add_ps (p, c1); /* c0 * f + c1 */
p = _mm_mul_ps (p, f); /* (c0 * f + c1) * f */
p = _mm_add_ps (p, c2); /* p = (c0 * f + c1) * f + c2 ~= 2^f */
j = _mm_slli_epi32 (i, 23); /* i << 23 */
r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/
return r;
}
int main (void)
{
union {
float f[4];
unsigned int i[4];
} arg, res;
double relerr, maxrelerr = 0.0;
int i, j;
__m128 x, y;
float start[2] = {-0.0f, 0.0f};
float finish[2] = {-87.33654f, 88.72283f};
for (i = 0; i < 2; i++) {
arg.f[0] = start[i];
arg.i[1] = arg.i[0] + 1;
arg.i[2] = arg.i[0] + 2;
arg.i[3] = arg.i[0] + 3;
do {
memcpy (&x, &arg, sizeof(x));
y = fast_exp_sse (x);
memcpy (&res, &y, sizeof(y));
for (j = 0; j < 4; j++) {
double ref = exp ((double)arg.f[j]);
relerr = fabs ((res.f[j] - ref) / ref);
if (relerr > maxrelerr) {
printf ("arg=% 15.8e res=%15.8e ref=%15.8e err=%15.8e\n",
arg.f[j], res.f[j], ref, relerr);
maxrelerr = relerr;
}
}
arg.i[0] += 4;
arg.i[1] += 4;
arg.i[2] += 4;
arg.i[3] += 4;
} while (fabsf (arg.f[3]) < fabsf (finish[i]));
}
printf ("maximum relative errror = %15.8e\n", maxrelerr);
return EXIT_SUCCESS;
}
An alternative design for fast_sse_exp() extracts the integer portion of the adjusted argument x / log(2) in round-to-nearest mode, using the well-known technique of adding the "magic" conversion constant 1.5 * 223 to force rounding in the correct bit position, then subtracting out the same number again. This requires that the SSE rounding mode in effect during the addition is "round to nearest or even", which is the default. wim pointed out in comments that some compilers may optimize out the addition and subtraction of the conversion constant cvt as redundant when aggressive optimization is used, interfering with the functionality of this code sequence, so it is recommended to inspect the machine code generated. The approximation interval for computation of 2f is now centered around zero, since -0.5 <= f <= 0.5, requiring a different core approximation.
/* max. rel. error <= 1.72860465e-3 on [-87.33654, 88.72283] */
__m128 fast_exp_sse (__m128 x)
{
__m128 t, f, p, r;
__m128i i, j;
const __m128 l2e = _mm_set1_ps (1.442695041f); /* log2(e) */
const __m128 cvt = _mm_set1_ps (12582912.0f); /* 1.5 * (1 << 23) */
const __m128 c0 = _mm_set1_ps (0.238428936f);
const __m128 c1 = _mm_set1_ps (0.703448006f);
const __m128 c2 = _mm_set1_ps (1.000443142f);
/* exp(x) = 2^i * 2^f; i = rint (log2(e) * x), -0.5 <= f <= 0.5 */
t = _mm_mul_ps (x, l2e); /* t = log2(e) * x */
r = _mm_sub_ps (_mm_add_ps (t, cvt), cvt); /* r = rint (t) */
f = _mm_sub_ps (t, r); /* f = t - rint (t) */
i = _mm_cvtps_epi32 (t); /* i = (int)t */
p = c0; /* c0 */
p = _mm_mul_ps (p, f); /* c0 * f */
p = _mm_add_ps (p, c1); /* c0 * f + c1 */
p = _mm_mul_ps (p, f); /* (c0 * f + c1) * f */
p = _mm_add_ps (p, c2); /* p = (c0 * f + c1) * f + c2 ~= exp2(f) */
j = _mm_slli_epi32 (i, 23); /* i << 23 */
r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/
return r;
}
The algorithm for the code in the question appears to be taken from the work of Nicol N. Schraudolph, which cleverly exploits the semi-logarithmic nature of IEEE-754 binary floating-point formats:
N. N. Schraudolph. "A fast, compact approximation of the exponential function." Neural Computation, 11(4), May 1999, pp.853-862.
After removal of the argument clamping code, it reduces to just three SSE instructions. The "magical" correction constant 486411 is not optimal for minimizing maximum relative error over the entire input domain. Based on simple binary search, the value 298765 seems to be superior, reducing maximum relative error for FastExpSse() to 3.56e-2 vs. maximum relative error of 1.73e-3 for fast_exp_sse().
/* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */
__m128 FastExpSse (__m128 x)
{
__m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
__m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765);
__m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b);
return _mm_castsi128_ps (t);
}
Schraudolph's algorithm basically uses the linear approximation 2f ~= 1.0 + f for f in [0,1], and its accuracy could be improved by adding a quadratic term. The clever part of Schraudolph's approach is computing 2i * 2f without explicitly separating the integer portion i = floor(x * 1.44269504) from the fraction. I see no way to extend that trick to a quadratic approximation, but one can certainly combine the floor() computation from Schraudolph with the quadratic approximation used above:
/* max. rel. error <= 1.72886892e-3 on [-87.33654, 88.72283] */
__m128 fast_exp_sse (__m128 x)
{
__m128 f, p, r;
__m128i t, j;
const __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
const __m128i m = _mm_set1_epi32 (0xff800000); /* mask for integer bits */
const __m128 ttm23 = _mm_set1_ps (1.1920929e-7f); /* exp2(-23) */
const __m128 c0 = _mm_set1_ps (0.3371894346f);
const __m128 c1 = _mm_set1_ps (0.657636276f);
const __m128 c2 = _mm_set1_ps (1.00172476f);
t = _mm_cvtps_epi32 (_mm_mul_ps (a, x));
j = _mm_and_si128 (t, m); /* j = (int)(floor (x/log(2))) << 23 */
t = _mm_sub_epi32 (t, j);
f = _mm_mul_ps (ttm23, _mm_cvtepi32_ps (t)); /* f = (x/log(2)) - floor (x/log(2)) */
p = c0; /* c0 */
p = _mm_mul_ps (p, f); /* c0 * f */
p = _mm_add_ps (p, c1); /* c0 * f + c1 */
p = _mm_mul_ps (p, f); /* (c0 * f + c1) * f */
p = _mm_add_ps (p, c2); /* p = (c0 * f + c1) * f + c2 ~= 2^f */
r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/
return r;
}
A good increase in accuracy in my algorithm (implementation FastExpSse in the answer above) can be obtained at the cost of an integer subtraction and floating-point division by using FastExpSse(x/2)/FastExpSse(-x/2) instead of FastExpSse(x). The trick here is to set the shift parameter (298765 above) to zero so that the piecewise linear approximations in the numerator and denominator line up to give you substantial error cancellation. Roll it into a single function:
__m128 BetterFastExpSse (__m128 x)
{
const __m128 a = _mm_set1_ps ((1 << 22) / float(M_LN2)); // to get exp(x/2)
const __m128i b = _mm_set1_epi32 (127 * (1 << 23)); // NB: zero shift!
__m128i r = _mm_cvtps_epi32 (_mm_mul_ps (a, x));
__m128i s = _mm_add_epi32 (b, r);
__m128i t = _mm_sub_epi32 (b, r);
return _mm_div_ps (_mm_castsi128_ps (s), _mm_castsi128_ps (t));
}
(I'm not a hardware guy - how bad a performance killer is the division here?)
If you need exp(x) just to get y = tanh(x) (e.g. for neural networks), use FastExpSse with zero shift as follows:
a = FastExpSse(x);
b = FastExpSse(-x);
y = (a - b)/(a + b);
to get the same type of error cancellation benefit. The logistic function works similarly, using FastExpSse(x/2)/(FastExpSse(x/2) + FastExpSse(-x/2)) with zero shift. (This is just to show the principle - you obviously don't want to evaluate FastExpSse multiple times here, but roll it into a single function along the lines of BetterFastExpSse above.)
I did develop a series of higher-order approximations from this, ever more accurate but also slower. Unpublished but happy to collaborate if anyone wants to give them a spin.
And finally, for some fun: use in reverse gear to get FastLogSse. Chaining that with FastExpSse gives you both operator and error cancellation, and out pops a blazingly fast power function...
Going back through my notes from way back then, I did explore ways to improve the accuracy without using division. I used the same reinterpret-as-float trick but applied a polynomial correction to the mantissa which was essentially calculated in 16-bit fixed-point arithmetic (the only way to do it fast back then).
The cubic resp. quartic versions give you 4 resp. 5 significant digits of accuracy. There was no point increasing the order beyond that, as the noise of the low-precision arithmetic then starts to drown out the error of the polynomial approximation. Here are the plain C versions:
#include <stdint.h>
float fastExp3(register float x) // cubic spline approximation
{
union { float f; int32_t i; } reinterpreter;
reinterpreter.i = (int32_t)(12102203.0f*x) + 127*(1 << 23);
int32_t m = (reinterpreter.i >> 7) & 0xFFFF; // copy mantissa
// empirical values for small maximum relative error (8.34e-5):
reinterpreter.i +=
((((((((1277*m) >> 14) + 14825)*m) >> 14) - 79749)*m) >> 11) - 626;
return reinterpreter.f;
}
float fastExp4(register float x) // quartic spline approximation
{
union { float f; int32_t i; } reinterpreter;
reinterpreter.i = (int32_t)(12102203.0f*x) + 127*(1 << 23);
int32_t m = (reinterpreter.i >> 7) & 0xFFFF; // copy mantissa
// empirical values for small maximum relative error (1.21e-5):
reinterpreter.i += (((((((((((3537*m) >> 16)
+ 13668)*m) >> 18) + 15817)*m) >> 14) - 80470)*m) >> 11);
return reinterpreter.f;
}
The quartic one obeys (fastExp4(0f) == 1f), which can be important for fixed-point iteration algorithms.
How efficient are these integer multiply-shift-add sequences in SSE? On architectures where float arithmetic is just as fast, one could use that instead, reducing the arithmetic noise. This would essentially yield cubic and quartic extensions of #njuffa's answer above.
There is a paper about creating fast versions of these equations (tanh, cosh, artanh, sinh, etc):
http://ijeais.org/wp-content/uploads/2018/07/IJAER180702.pdf
"Creating a Compiler Optimized Inlineable Implementation of Intel Svml Simd Intrinsics"
their tanh equation 6, on page 9 is very similar to #NicSchraudolph answer
For softmax use, I'm envisioning the flow as:
auto a = _mm_mul_ps(x, _mm_set1_ps(12102203.2f));
auto b = _mm_castsi128_ps(_mm_cvtps_epi32(a)); // so far as in other variants
// copy 9 MSB from 0x3f800000 over 'b' so that 1 <= c < 2
// - also 1 <= poly_eval(...) < 2
auto c = replace_exponent(b, _mm_set1_ps(1.0f));
auto d = poly_eval(c, kA, kB, kC); // 2nd degree polynomial
auto e = replace_exponent(d, b); // restore exponent : 2^i * 2^f
The exponent copying can be done as bitwise select using a proper mask (AVX-512 has vpternlogd, and I'm using actually Arm Neon vbsl).
All the input values x must be negative and clamped between -17-f(N) <= x <= -f(N), so that when scaled by (1<<23)/log(2), the maximum sum of the N resulting floating point values do not reach infinity and that the reciprocal does not become denormal. For N=3, f(N) = 4. Larger f(N) will trade off input precision.
The polyeval coefficients are generated for example by polyfit([1 1.5 2],[1 sqrt(2) 2]), with kA=0.343146, kB=-0.029437, kC=0.68292, producing strictly values smaller than 2 and preventing discontinuities. The maximum average error can be diminished by evaluating the polynomial at x=[1+max_err 1.5-eps 2], y=[1 2^(.5-eps) 2-max_err].
For strictly SSE/AVX, exponent replacement for 1.0f can be done by (x & 0x007fffff) | 0x3f800000). A two instruction sequence for the latter exponent replacement can be found by ensuring that poly_eval(x) evaluates to a range, which can be directly ored with b & 0xff800000.
I have developed for my purposes the following function that calculates quickly and accurately the natural exponent with single precision. The function works in the entire range of float values. The code is written under Visual Studio (x86). AVX is used instead of SSE, but that shouldn't be a problem. The accuracy of this function is almost the same as standard expf function, but significantly faster. Used approximation is based on the Chebyshev series expansion of the function f(t)=t/(2^(t/2)-1)+t/2 for t from the [-1; 1]. I thank Peter Cordes for his good advice.
_declspec(naked) float _vectorcall fexp(float x)
{
static const float ct[7] = // Constants table
{
1.44269502f, // lb(e)
1.92596299E-8f, // Correction to the value lb(e)
-9.21120925E-4f, // 16*b2
0.115524396f, // 4*b1
2.88539004f, // b0
2.0f, // 2
4.65661287E-10f // 2^-31
};
_asm
{
mov ecx,offset ct // ecx contains the address of constants tables
vmulss xmm1,xmm0,[ecx] // xmm1 = x*lb(e)
vcvtss2si eax,xmm1 // eax = round(x*lb(e)) = k
cdq // edx=-1, if x<0 or overflow, otherwise edx=0
vmovss xmm3,[ecx+8] // Initialize the sum with highest coefficient 16*b2
and edx,4 // edx=4, if x<0 or overflow, otherwise edx=0
vcvtsi2ss xmm1,xmm1,eax // xmm1 = k
lea eax,[eax+8*edx] // Add 32 to exponent, if x<0
vfmsub231ss xmm1,xmm0,[ecx] // xmm1 = x*lb(e)-k = t/2 in the range from -0,5 to 0,5
add eax,126 // The exponent of 2^(k-1) or 2^(k+31) with bias 127
jle exp_low // Jump if x<<0 or overflow (|x| too large or x=NaN)
vfmadd132ss xmm0,xmm1,[ecx+4] // xmm0 = t/2 (corrected value)
cmp eax,254 // Check that the exponent is not too large
jg exp_inf // Jump to set Inf if overflow
vmulss xmm2,xmm0,xmm0 // xmm2 = t^2/4 - the argument of the polynomial
shl eax,23 // The bits of the float value 2^(k-1) or 2^(k+31)
vfmadd213ss xmm3,xmm2,[ecx+12] // xmm3 = 4*b1+4*b2*t^2
vmovd xmm1,eax // xmm1 = 2^(k-1) или 2^(k+31)
vfmsub213ss xmm3,xmm2,xmm0 // xmm3 = -t/2+b1*t^2+b2*t^4
vaddss xmm0,xmm0,xmm0 // xmm0 = t
vaddss xmm3,xmm3,[ecx+16] // xmm3 = b0-t/2+b1*t^2+b2*t^4 = f(t)-t/2
vdivss xmm0,xmm0,xmm3 // xmm0 = t/(f(t)-t/2)
vfmadd213ss xmm0,xmm1,xmm1 // xmm0 = e^x with shifted exponent of -1 or 31
vmulss xmm0,xmm0,[ecx+edx+20] // xmm0 = e^x
ret // Return
exp_low: // Handling the case of x<<0 or overflow
vucomiss xmm0,[ecx] // Check the sign of x and a condition x=NaN
jp exp_end // Complete with NaN result, if x=NaN
exp_inf: // Entry point for processing large x
vxorps xmm0,xmm0,xmm0 // xmm0 = 0
jc exp_end // Ready, if x<<0
vrcpss xmm0,xmm0,xmm0 // xmm0 = Inf in case x>>0
exp_end: // The result at xmm0 is ready
ret // Return
}
}
Below I post a simplified algorithm. Support for denormalized numbers in the result is removed here.
_declspec(naked) float _vectorcall fexp(float x)
{
static const float ct[5] = // Constants table
{
1.44269502f, // lb(e)
1.92596299E-8f, // Correction to the value lb(e)
-9.21120925E-4f, // 16*b2
0.115524396f, // 4*b1
2.88539004f // b0
};
_asm
{
mov edx,offset ct // edx contains the address of constants tables
vmulss xmm1,xmm0,[edx] // xmm1 = x*lb(e)
vcvtss2si eax,xmm1 // eax = round(x*lb(e)) = k
vmovss xmm3,[edx+8] // Initialize the sum with highest coefficient 16*b2
vcvtsi2ss xmm1,xmm1,eax // xmm1 = k
cmp eax,127 // Check that the exponent is not too large
jg exp_break // Jump to set Inf if overflow
vfmsub231ss xmm1,xmm0,[edx] // xmm1 = x*lb(e)-k = t/2 in the range from -0,5 to 0,5
add eax,127 // Receive the exponent of 2^k with the bias 127
jle exp_break // The result is 0, if x<<0
vfmadd132ss xmm0,xmm1,[edx+4] // xmm0 = t/2 (corrected value)
vmulss xmm2,xmm0,xmm0 // xmm2 = t^2/4 - the argument of polynomial
shl eax,23 // eax contains the bits of 2^k
vfmadd213ss xmm3,xmm2,[edx+12] // xmm3 = 4*b1+4*b2*t^2
vmovd xmm1,eax // xmm1 = 2^k
vfmsub213ss xmm3,xmm2,xmm0 // xmm3 = -t/2+b1*t^2+b2*t^4
vaddss xmm0,xmm0,xmm0 // xmm0 = t
vaddss xmm3,xmm3,[edx+16] // xmm3 = b0-t/2+b1*t^2+b2*t^4 = f(t)-t/2
vdivss xmm0,xmm0,xmm3 // xmm0 = t/(f(t)-t/2)
vfmadd213ss xmm0,xmm1,xmm1 // xmm0 = 2^k*(t/(f(t)-t/2)+1) = e^x
ret // Return
exp_break: // Get 0 for x<0 or Inf for x>>0
vucomiss xmm0,[edx] // Check the sign of x and a condition x=NaN
jp exp_end // Complete with NaN result, if x=NaN
vxorps xmm0,xmm0,xmm0 // xmm0 = 0
jc exp_end // Ready, if x<<0
vrcpss xmm0,xmm0,xmm0 // xmm0 = Inf, if x>>0
exp_end: // The result at xmm0 is ready
ret // Return
}
}
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I've been learning about faster exponentiation algorithms (k-ary, sliding door etc.), and was wondering which ones are used in CPUs/programming languages? (I'm fuzzy on whether or not this happens in the CPU or through the compiler)
And just for kicks, which is the fastest?
Edit regarding the broadness: It's intentionally broad because I know there are a bunch of different techniques to do this. The checked answer had what I was looking for.
I assume your interest is in implementation of the exponentiation functions that can be found in standard math libraries for HLLs, in particular C/C++. These include the functions exp(), exp2(), exp10(), and pow(), as well as single-precision counterparts expf(), exp2f(), exp10f(), and powf().
The exponentiation methods you mention (such as k-ary, sliding window) are typically employed in cryptographic algorithms, such as RSA, which is exponentiation based. They are not typically used for the exponentiation functions provided via math.h or cmath. The implementation details for standard math functions like exp() differ, but a common scheme follows a three-step process:
reduction of the function argument to a primary approximation
interval
approximation of a suitable base function on the primary approximation interval
mapping back the result for the primary interval to the entire range of the function
An auxiliary step is often the handling of special cases. These can pertain to special mathematical situations such as log(0.0), or special floating-point operands such as NaN (Not a Number).
The C99 code for expf(float) below shows in exemplary fashion what those steps look like for a concrete example. The argument a is first split such that exp(a) = er * 2i, where i is an integer and r is in [log(sqrt(0.5), log(sqrt(2.0)], the primary approximation interval. In the second step, we now approximate er with a polynomial. Such approximations can be designed according to various design criteria such as minimizing absolute or relative error. The polynomial can be evaluated in various ways including Horner's scheme and Estrin's scheme.
The code below uses a very common approach by employing a minimax approximation, which minimizes the maximum error over the entire approximation interval. A standard algorithm for computing such approximations is the Remez algorithm. Evaluation is via Horner's scheme; the numerical accuracy of this evaluation is enhanced by the use of fmaf().
This standard math function implements what is known as a fused multiply-add or FMA. This computes a*b+c using the full product a*b during addition and applying a single rounding at the end. On most modern hardware, such as GPUs, IBM Power CPUs, recent x86 processors (e.g. Haswell), recent ARM processors (as an optional extension), this maps straight to a hardware instruction. On platforms that lack such an instruction, fmaf() will map to fairly slow emulation code, in which case we would not want to use it if we are interested in performance.
The final computation is the multiplication by 2i, for which C and C++ provide the function ldexp(). In "industrial strength" library code one typically uses a machine-specific idiom here that takes advantage of the use of IEEE-754 binary arithmetic for float. Lastly, the code cleans up cases of overflow and underflow.
The x87 FPU inside x86 processors has an instruction F2XM1 that computes 2x-1 on [-1,1]. This can be used for second step of the computation of exp() and exp2(). There is an instruction FSCALE which is used to multiply by2i in the third step. A common way of implementing F2XM1 itself is as microcode that utilizes a rational or polynomial approximation. Note that the x87 FPU is maintained mostly for legacy support these days. On modern x86 platform libraries typically use pure software implementations based on SSE and algorithms similar to the one shown below. Some combine small tables with polynomial approximations.
pow(x,y) can be conceptually implemented as exp(y*log(x)), but this suffers from significant loss of accuracy when x is near unity and y in large in magnitude, as well as incorrect handling of numerous special cases specified in the C/C++ standards. One way to get around the accuracy issue is to compute log(x) and the product y*log(x)) in some form of extended precision. The details would fill an entire, lengthy separate answer, and I do not have code handy to demonstrate it. In various C/C++ math libraries, pow(double,int) and powf(float, int) are computed by a separate code path that applies the "square-and-multiply" method with bit-wise scanning of the the binary representation of the integer exponent.
#include <math.h> /* import fmaf(), ldexpf(), INFINITY */
/* Like rintf(), but -0.0f -> +0.0f, and |a| must be < 2**22 */
float quick_and_dirty_rintf (float a)
{
const float cvt_magic = 0x1.800000p+23f;
return (a + cvt_magic) - cvt_magic;
}
/* Approximate exp(a) on the interval [log(sqrt(0.5)), log(sqrt(2.0))]. */
float expf_poly (float a)
{
float r;
r = 0x1.694000p-10f; // 1.37805939e-3
r = fmaf (r, a, 0x1.125edcp-07f); // 8.37312452e-3
r = fmaf (r, a, 0x1.555b5ap-05f); // 4.16695364e-2
r = fmaf (r, a, 0x1.555450p-03f); // 1.66664720e-1
r = fmaf (r, a, 0x1.fffff6p-02f); // 4.99999851e-1
r = fmaf (r, a, 0x1.000000p+00f); // 1.00000000e+0
r = fmaf (r, a, 0x1.000000p+00f); // 1.00000000e+0
return r;
}
/* Approximate exp2() on interval [-0.5,+0.5] */
float exp2f_poly (float a)
{
float r;
r = 0x1.418000p-13f; // 1.53303146e-4
r = fmaf (r, a, 0x1.5efa94p-10f); // 1.33887795e-3
r = fmaf (r, a, 0x1.3b2c6cp-07f); // 9.61833261e-3
r = fmaf (r, a, 0x1.c6af8ep-05f); // 5.55036329e-2
r = fmaf (r, a, 0x1.ebfbe0p-03f); // 2.40226507e-1
r = fmaf (r, a, 0x1.62e430p-01f); // 6.93147182e-1
r = fmaf (r, a, 0x1.000000p+00f); // 1.00000000e+0
return r;
}
/* Approximate exp10(a) on [log(sqrt(0.5))/log(10), log(sqrt(2.0))/log(10)] */
float exp10f_poly (float a)
{
float r;
r = 0x1.a56000p-3f; // 0.20574951
r = fmaf (r, a, 0x1.155aa8p-1f); // 0.54170728
r = fmaf (r, a, 0x1.2bda96p+0f); // 1.17130411
r = fmaf (r, a, 0x1.046facp+1f); // 2.03465796
r = fmaf (r, a, 0x1.53524ap+1f); // 2.65094876
r = fmaf (r, a, 0x1.26bb1cp+1f); // 2.30258512
r = fmaf (r, a, 0x1.000000p+0f); // 1.00000000
return r;
}
/* Compute exponential base e. Maximum ulp error = 0.86565 */
float my_expf (float a)
{
float t, r;
int i;
t = a * 0x1.715476p+0f; // 1/log(2); 1.442695
t = quick_and_dirty_rintf (t);
i = (int)t;
r = fmaf (t, -0x1.62e400p-01f, a); // log_2_hi; -6.93145752e-1
r = fmaf (t, -0x1.7f7d1cp-20f, r); // log_2_lo; -1.42860677e-6
t = expf_poly (r);
r = ldexpf (t, i);
if (a < -105.0f) r = 0.0f;
if (a > 105.0f) r = INFINITY; // +INF
return r;
}
/* Compute exponential base 2. Maximum ulp error = 0.86770 */
float my_exp2f (float a)
{
float t, r;
int i;
t = quick_and_dirty_rintf (a);
i = (int)t;
r = a - t;
t = exp2f_poly (r);
r = ldexpf (t, i);
if (a < -152.0f) r = 0.0f;
if (a > 152.0f) r = INFINITY; // +INF
return r;
}
/* Compute exponential base 10. Maximum ulp error = 0.95588 */
float my_exp10f (float a)
{
float r, t;
int i;
t = a * 0x1.a934f0p+1f; // log2(10); 3.321928
t = quick_and_dirty_rintf (t);
i = (int)t;
r = fmaf (t, -0x1.344140p-2f, a); // log10(2)_hi // -3.01030159e-1
r = fmaf (t, 0x1.5ec10cp-23f, r); // log10(2)_lo // 1.63332601e-7
t = exp10f_poly (r);
r = ldexpf (t, i);
if (a < -46.0f) r = 0.0f;
if (a > 46.0f) r = INFINITY; // +INF
return r;
}
#include <string.h>
#include <stdint.h>
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
uint64_t double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r, &a, sizeof r);
return r;
}
double floatUlpErr (float res, double ref)
{
uint64_t i, j, err, refi;
int expoRef;
/* ulp error cannot be computed if either operand is NaN, infinity, zero */
if (isnan (res) || isnan (ref) || isinf (res) || isinf (ref) ||
(res == 0.0f) || (ref == 0.0f)) {
return 0.0;
}
/* Convert the float result to an "extended float". This is like a float
with 56 instead of 24 effective mantissa bits.
*/
i = ((uint64_t)float_as_uint32(res)) << 32;
/* Convert the double reference to an "extended float". If the reference is
>= 2^129, we need to clamp to the maximum "extended float". If reference
is < 2^-126, we need to denormalize because of the float types's limited
exponent range.
*/
refi = double_as_uint64(ref);
expoRef = (int)(((refi >> 52) & 0x7ff) - 1023);
if (expoRef >= 129) {
j = 0x7fffffffffffffffULL;
} else if (expoRef < -126) {
j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
j = j >> (-(expoRef + 126));
} else {
j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
j = j | ((uint64_t)(expoRef + 127) << 55);
}
j = j | (refi & 0x8000000000000000ULL);
err = (i < j) ? (j - i) : (i - j);
return err / 4294967296.0;
}
#include <stdio.h>
#include <stdlib.h>
int main (void)
{
double ref, ulp, maxulp;
float arg, res, reff;
uint32_t argi, resi, refi, diff, sumdiff;
printf ("testing expf ...\n");
argi = 0;
sumdiff = 0;
maxulp = 0;
do {
arg = uint32_as_float (argi);
res = my_expf (arg);
ref = exp ((double)arg);
ulp = floatUlpErr (res, ref);
if (ulp > maxulp) maxulp = ulp;
reff = (float)ref;
refi = float_as_uint32 (reff);
resi = float_as_uint32 (res);
diff = (resi < refi) ? (refi - resi) : (resi - refi);
if (diff > 1) {
printf ("!! expf: arg=%08x res=%08x ref=%08x\n", argi, resi, refi);
return EXIT_FAILURE;
} else {
sumdiff += diff;
}
argi++;
} while (argi);
printf ("expf maxulp=%.5f sumdiff=%u\n", maxulp, sumdiff);
printf ("testing exp2f ...\n");
argi = 0;
maxulp = 0;
sumdiff = 0;
do {
arg = uint32_as_float (argi);
res = my_exp2f (arg);
ref = exp2 ((double)arg);
ulp = floatUlpErr (res, ref);
if (ulp > maxulp) maxulp = ulp;
reff = (float)ref;
refi = float_as_uint32 (reff);
resi = float_as_uint32 (res);
diff = (resi < refi) ? (refi - resi) : (resi - refi);
if (diff > 1) {
printf ("!! expf: arg=%08x res=%08x ref=%08x\n", argi, resi, refi);
return EXIT_FAILURE;
} else {
sumdiff += diff;
}
argi++;
} while (argi);
printf ("exp2f maxulp=%.5f sumdiff=%u\n", maxulp, sumdiff);
printf ("testing exp10f ...\n");
argi = 0;
maxulp = 0;
sumdiff = 0;
do {
arg = uint32_as_float (argi);
res = my_exp10f (arg);
ref = exp10 ((double)arg);
ulp = floatUlpErr (res, ref);
if (ulp > maxulp) maxulp = ulp;
reff = (float)ref;
refi = float_as_uint32 (reff);
resi = float_as_uint32 (res);
diff = (resi < refi) ? (refi - resi) : (resi - refi);
if (diff > 1) {
printf ("!! expf: arg=%08x res=%08x ref=%08x\n", argi, resi, refi);
return EXIT_FAILURE;
} else {
sumdiff += diff;
}
argi++;
} while (argi);
printf ("exp10f maxulp=%.5f sumdiff=%u\n", maxulp, sumdiff);
return EXIT_SUCCESS;
}
Is it possible to calculate the inverse error function in C?
I can find erf(x) in <math.h> which calculates the error function, but I can't find anything to do the inverse.
At this time, the ISO C standard math library does not include erfinv(), or its single-precision variant erfinvf(). However, it is not too difficult to create one's own version, which I demonstrate below with an implementation of erfinvf() of reasonable accuracy and performance.
Looking at the graph of the inverse error function we observe that it is highly non-linear and is therefore difficult to approximate with a polynomial. One strategy do deal with this scenario is to "linearize" such a function by compositing it from simpler elementary functions (which can themselves be computed at high performance and excellent accuracy) and a fairly linear function which is more easily amenable to polynomial approximations or rational approximations of low degree.
Here are some approaches to erfinv linearization known from the literature, all of which are based on logarithms. Typically, authors differentiate between a main, fairly linear portion of the inverse error function from zero to a switchover point very roughly around 0.9 and a tail portion from the switchover point to unity. In the following, log() denotes the natural logarithm, R() denotes a rational approximation, and P() denotes a polynomial approximation.
A. J. Strecok, "On the Calculation of the Inverse of the Error Function."
Mathematics of Computation, Vol. 22, No. 101 (Jan. 1968), pp. 144-158 (online)
β(x) = (-log(1-x2]))½; erfinv(x) = x · R(x2) [main]; R(x) · β(x) [tail]
J. M. Blair, C. A. Edwards, J. H. Johnson, "Rational Chebyshev Approximations for the Inverse of the Error Function." Mathematics of Computation, Vol. 30, No. 136 (Oct. 1976), pp. 827-830 (online)
ξ = (-log(1-x))-½; erfinv(x) = x · R(x2) [main]; ξ-1 · R(ξ) [tail]
M. Giles, "Approximating the erfinv function." In GPU Computing Gems Jade Edition, pp. 109-116. 2011. (online)
w = -log(1-x2); s = √w; erfinv(x) = x · P(w) [main]; x · P(s) [tail]
The solution below generally follows the approach by Giles, but simplifies it in not requiring the square root for the tail portion, i.e. it uses two approximations of the type x · P(w). The code takes maximum advantage of the fused multiply-add operation FMA, which is exposed via the standard math functions fma() and fmaf() in C. Many common compute platforms, such as
IBM Power, Arm64, x86-64, and GPUs offer this operation in hardware. Where no hardware support exists, the use of fma{f}() will likely make the code below unacceptably slow as the operation needs to be emulated by the standard math library. Also, functionally incorrect emulations of FMA are known to exist.
The accuracy of the standard math library's logarithm function logf() will have some impact on the accuracy of my_erfinvf() below. As long as the library provides a faithfully-rounded implementation with error < 1 ulp, the stated error bound should hold and it did for the few libraries I tried. For improved reproducability, I have included my own portable faithfully-rounded implementation, my_logf().
#include <math.h>
float my_logf (float);
/* compute inverse error functions with maximum error of 2.35793 ulp */
float my_erfinvf (float a)
{
float p, r, t;
t = fmaf (a, 0.0f - a, 1.0f);
t = my_logf (t);
if (fabsf(t) > 6.125f) { // maximum ulp error = 2.35793
p = 3.03697567e-10f; // 0x1.4deb44p-32
p = fmaf (p, t, 2.93243101e-8f); // 0x1.f7c9aep-26
p = fmaf (p, t, 1.22150334e-6f); // 0x1.47e512p-20
p = fmaf (p, t, 2.84108955e-5f); // 0x1.dca7dep-16
p = fmaf (p, t, 3.93552968e-4f); // 0x1.9cab92p-12
p = fmaf (p, t, 3.02698812e-3f); // 0x1.8cc0dep-9
p = fmaf (p, t, 4.83185798e-3f); // 0x1.3ca920p-8
p = fmaf (p, t, -2.64646143e-1f); // -0x1.0eff66p-2
p = fmaf (p, t, 8.40016484e-1f); // 0x1.ae16a4p-1
} else { // maximum ulp error = 2.35002
p = 5.43877832e-9f; // 0x1.75c000p-28
p = fmaf (p, t, 1.43285448e-7f); // 0x1.33b402p-23
p = fmaf (p, t, 1.22774793e-6f); // 0x1.499232p-20
p = fmaf (p, t, 1.12963626e-7f); // 0x1.e52cd2p-24
p = fmaf (p, t, -5.61530760e-5f); // -0x1.d70bd0p-15
p = fmaf (p, t, -1.47697632e-4f); // -0x1.35be90p-13
p = fmaf (p, t, 2.31468678e-3f); // 0x1.2f6400p-9
p = fmaf (p, t, 1.15392581e-2f); // 0x1.7a1e50p-7
p = fmaf (p, t, -2.32015476e-1f); // -0x1.db2aeep-3
p = fmaf (p, t, 8.86226892e-1f); // 0x1.c5bf88p-1
}
r = a * p;
return r;
}
/* compute natural logarithm with a maximum error of 0.85089 ulp */
float my_logf (float a)
{
float i, m, r, s, t;
int e;
m = frexpf (a, &e);
if (m < 0.666666667f) { // 0x1.555556p-1
m = m + m;
e = e - 1;
}
i = (float)e;
/* m in [2/3, 4/3] */
m = m - 1.0f;
s = m * m;
/* Compute log1p(m) for m in [-1/3, 1/3] */
r = -0.130310059f; // -0x1.0ae000p-3
t = 0.140869141f; // 0x1.208000p-3
r = fmaf (r, s, -0.121484190f); // -0x1.f19968p-4
t = fmaf (t, s, 0.139814854f); // 0x1.1e5740p-3
r = fmaf (r, s, -0.166846052f); // -0x1.55b362p-3
t = fmaf (t, s, 0.200120345f); // 0x1.99d8b2p-3
r = fmaf (r, s, -0.249996200f); // -0x1.fffe02p-3
r = fmaf (t, m, r);
r = fmaf (r, m, 0.333331972f); // 0x1.5554fap-2
r = fmaf (r, m, -0.500000000f); // -0x1.000000p-1
r = fmaf (r, s, m);
r = fmaf (i, 0.693147182f, r); // 0x1.62e430p-1 // log(2)
if (!((a > 0.0f) && (a <= 3.40282346e+38f))) { // 0x1.fffffep+127
r = a + a; // silence NaNs if necessary
if (a < 0.0f) r = ( 0.0f / 0.0f); // NaN
if (a == 0.0f) r = (-1.0f / 0.0f); // -Inf
}
return r;
}
Quick & dirty, tolerance under +-6e-3. Work based on "A handy approximation for the error function and its inverse" by Sergei Winitzki.
C/C++ CODE:
float myErfInv2(float x){
float tt1, tt2, lnx, sgn;
sgn = (x < 0) ? -1.0f : 1.0f;
x = (1 - x)*(1 + x); // x = 1 - x*x;
lnx = logf(x);
tt1 = 2/(PI*0.147) + 0.5f * lnx;
tt2 = 1/(0.147) * lnx;
return(sgn*sqrtf(-tt1 + sqrtf(tt1*tt1 - tt2)));
}
MATLAB sanity check:
clear all, close all, clc
x = linspace(-1, 1,10000);
% x = 1 - logspace(-8,-15,1000);
a = 0.15449436008930206298828125;
% a = 0.147;
u = log(1-x.^2);
u1 = 2/(pi*a) + u/2; u2 = u/a;
y = sign(x).*sqrt(-u1+sqrt(u1.^2 - u2));
f = erfinv(x); axis equal
figure(1);
plot(x, [y; f]); legend('Approx. erf(x)', 'erf(x)')
figure(2);
e = f-y;
plot(x, e);
MATLAB Plots:
I don't think it's a standard implementation in <math.h>, but there are other C math libraries that have implement the inverse error function erfinv(x), that you can use.
Also quick and dirty: if less precision is allowed than I share my own approximation with the inverse hyperbolic tangent - the parameters are sought by monte carle simulation where all random values are between the range of 0.5 and 1.5:
p1 = 1.4872301551536515
p2 = 0.5739159012216655
p3 = 0.5803635928651558
atanh( p^( 1 / p3 ) ) / p2 )^( 1 / p1 )
This comes from the algebraic reordering of my erf function approximation with the hyperbolic tangent, where the RMSE error is 0.000367354 for x between 1 and 4:
tanh( x^p1 * p2 )^p3
I wrote another method that uses the fast-converging Newton-Rhapson method, which is an iterative method to find the root of a function. It starts with an initial guess and then iteratively improves the guess by using the derivative of the function. The Newton-Raphson method requires the function, its derivative, an initial guess and a stopping criteria.
In this case, the function we are trying to find the root of is erf(x) - x. And the derivative of this function is 2.0 / sqrt(pi) * exp(-x**2). The initial guess is the input value for x. The stopping criteria is a tolerance value, in this case it's 1.0e-16. Here is the code:
/*
============================================
Compile and execute with:
$ gcc inverf.c -o inverf -lm
$ ./inverf
============================================
*/
#include <stdio.h>
#include <math.h>
int main() {
double x, result, fx, dfx, dx, xold;
double tolerance = 1.0e-16;
double pi = 4.0 * atan(1.0);
int iteration, i;
// input value for x
printf("Calculator for inverse error function.\n");
printf("Enter the value for x: ");
scanf("%lf", &x);
// check the input value is between -1 and 1
if (x < -1.0 || x > 1.0) {
printf("Invalid input, x must be between -1 and 1.");
return 0;
}
// initial guess
result = x;
xold = 0.0;
iteration = 0;
// iterate until the solution converges
do {
xold = result;
fx = erf(result) - x;
dfx = 2.0 / sqrt(pi) * exp(-pow(result, 2.0));
dx = fx / dfx;
// update the solution
result = result - dx;
iteration = iteration + 1;
} while (fabs(result - xold) >= tolerance);
// output the result
printf("The inverse error function of %lf is %lf\n", x, result);
printf("Number of iterations: %d\n", iteration);
return 0;
}
In the terminal it should look something like this:
Calculator for inverse error function.
Enter the value for x: 0.5
The inverse error function of 0.500000 is 0.476936
Number of iterations: 5
For the simple and efficient implementation of fast math functions with reasonable accuracy, polynomial minimax approximations are often the method of choice. Minimax approximations are typically generated with a variant of the Remez algorithm. Various widely available tools such as Maple and Mathematica have built-in functionality for this. The generated coefficients are typically computed using high-precision arithmetic. It is well-known that simply rounding those coefficients to machine precision leads to suboptimal accuracy in the resulting implementation.
Instead, one searches for closely related sets of coefficients that are exactly representable as machine numbers to generate a machine-optimized approximation. Two relevant papers are:
Nicolas Brisebarre, Jean-Michel Muller, and Arnaud Tisserand, "Computing Machine-Efficient Polynomial Approximations", ACM Transactions on Mathematical Software, Vol. 32, No. 2, June 2006, pp. 236–256.
Nicolas Brisebarre and Sylvain Chevillard, "Efficient polynomial L∞-approximations", 18th IEEE Symposium on Computer Arithmetic (ARITH-18), Montpellier (France), June 2007, pp. 169-176.
An implementation of the LLL-algorithm from the latter paper is available as the fpminimax() command of the Sollya tool. It is my understanding that all algorithms proposed for the generation of machine-optimized approximations are based on heuristics, and that it is therefore generally unknown what accuracy can be achieved by an optimal approximation. It is not clear to me whether the availability of FMA (fused multiply-add) for the evaluation of the approximation has an influence on the answer to that question. It seems to me naively that it should.
I am currently looking at a simple polynomial approximation for arctangent on [-1,1] that is evaluated in IEEE-754 single-precision arithmetic, using the Horner scheme and FMA. See function atan_poly() in the C99 code below. For lack of access to a Linux machine at the moment, I did not use Sollya to generate these coefficients, but used my own heuristic that could be loosely described as a mixture of steepest decent and simulated annealing (to avoid getting stuck on local minima). The maximum error of my machine-optimized polynomial is very close to 1 ulp, but ideally I would like the maximum ulp error to be below 1 ulp.
I am aware that I could change my computation to increase the accuracy, for example by using a leading coefficient represented to more than single-precision precision, but I would like to keep the code exactly as is (that is, as simple as possible) adjusting only the coefficients to deliver the most accurate result possible.
A "proven" optimal set of coefficients would be ideal, pointers to relevant literature are welcome. I did a literature search but could not find any paper that advances the state of the art meaningfully beyond Sollya's fpminimax(), and none that examine the role of FMA (if any) in this issue.
// max ulp err = 1.03143
float atan_poly (float a)
{
float r, s;
s = a * a;
r = 0x1.7ed1ccp-9f;
r = fmaf (r, s, -0x1.0c2c08p-6f);
r = fmaf (r, s, 0x1.61fdd0p-5f);
r = fmaf (r, s, -0x1.3556b2p-4f);
r = fmaf (r, s, 0x1.b4e128p-4f);
r = fmaf (r, s, -0x1.230ad2p-3f);
r = fmaf (r, s, 0x1.9978ecp-3f);
r = fmaf (r, s, -0x1.5554dcp-2f);
r = r * s;
r = fmaf (r, a, a);
return r;
}
// max ulp err = 1.52637
float my_atanf (float a)
{
float r, t;
t = fabsf (a);
r = t;
if (t > 1.0f) {
r = 1.0f / r;
}
r = atan_poly (r);
if (t > 1.0f) {
r = fmaf (0x1.ddcb02p-1f, 0x1.aee9d6p+0f, -r); // pi/2 - r
}
r = copysignf (r, a);
return r;
}
The following function is a faithfully-rounded implementation of arctan on [0, 1]:
float atan_poly (float a) {
float s = a * a, u = fmaf(a, -a, 0x1.fde90cp-1f);
float r1 = 0x1.74dfb6p-9f;
float r2 = fmaf (r1, u, 0x1.3a1c7cp-8f);
float r3 = fmaf (r2, s, -0x1.7f24b6p-7f);
float r4 = fmaf (r3, u, -0x1.eb3900p-7f);
float r5 = fmaf (r4, s, 0x1.1ab95ap-5f);
float r6 = fmaf (r5, u, 0x1.80e87cp-5f);
float r7 = fmaf (r6, s, -0x1.e71aa4p-4f);
float r8 = fmaf (r7, u, -0x1.b81b44p-3f);
float r9 = r8 * s;
float r10 = fmaf (r9, a, a);
return r10;
}
The following test harness will abort if the function atan_poly fails to be faithfully-rounded on [1e-16, 1] and print "success" otherwise:
int checkit(float f) {
double d = atan(f);
float d1 = d, d2 = d;
if (d1 < d) d2 = nextafterf(d1, 1.0/0.0);
else d1 = nextafterf(d1, -1.0/0.0);
float p = atan_poly(f);
if (p != d1 && p != d2) return 0;
return 1;
}
int main() {
for (float f = 1; f > 1e-16; f = nextafterf(f, -1.0/0.0)) {
if (!checkit(f)) abort();
}
printf("success\n");
exit(0);
}
The problem with using s in every multiplication is that the polynomial's coefficients do not decay rapidly. Inputs close to 1 result in lots and lots of cancellation of nearly equal numbers, meaning you're trying to find a set of coefficients so that the accumulated roundoff at the end of the computation closely approximates the residual of arctan.
The constant 0x1.fde90cp-1f is a number close to 1 for which (arctan(sqrt(x)) - x) / x^3 is very close to the nearest float. That is, it's a constant that goes into the computation of u so that the cubic coefficient is almost completely determined. (For this program, the cubic coefficient must be either -0x1.b81b44p-3f or -0x1.b81b42p-3f.)
Alternating multiplications by s and u has the effect of reducing the effect of roundoff error in ri upon r{i+2} by a factor of at most 1/4, since s*u < 1/4 whatever a is. This gives considerable leeway in choosing the coefficients of fifth order and beyond.
I found the coefficients with the aid of two programs:
One program plugs in a bunch of test points, writes down a system of linear inequalities, and computes bounds on the coefficients from that system of inequalities. Notice that, given a, one can compute the range of r8 that lead to a faithfully-rounded result. To get linear inequalities, I pretended r8 would be computed as a polynomial in the floats s and u in real-number arithmetic; the linear inequalities constrained this real-number r8 to lie in some interval. I used the Parma Polyhedra Library to handle these constraint systems.
Another program randomly tested sets of coefficients in certain ranges, plugging in first a set of test points and then all floats from 1 to 1e-8 in descending order and checking that atan_poly produces a faithful rounding of atan((double)x). If some x failed, it printed out that x and why it failed.
To get coefficients, I hacked this first program to fix c3, work out bounds on r7 for each test point, then get bounds on the higher-order coefficients. Then I hacked it to fix c3 and c5 and get bounds on the higher-order coefficients. I did this until I had all but the three highest-order coefficients, c13, c15, and c17.
I grew the set of test points in the second program until it either stopped printing anything out or printed out "success". I needed surprisingly few test points to reject almost all wrong polynomials---I count 85 test points in the program.
Here I show some of my work selecting the coefficients. In order to get a faithfully-rounded arctan for my initial set of test points assuming r1 through r8 are evaluated in real arithmetic (and rounded somehow unpleasantly but in a way I can't remember) but r9 and r10 are evaluated in float arithmetic, I need:
-0x1.b81b456625f15p-3 <= c3 <= -0x1.b81b416e22329p-3
-0x1.e71d48d9c2ca4p-4 <= c5 <= -0x1.e71783472f5d1p-4
0x1.80e063cb210f9p-5 <= c7 <= 0x1.80ed6efa0a369p-5
0x1.1a3925ea0c5a9p-5 <= c9 <= 0x1.1b3783f148ed8p-5
-0x1.ec6032f293143p-7 <= c11 <= -0x1.e928025d508p-7
-0x1.8c06e851e2255p-7 <= c13 <= -0x1.732b2d4677028p-7
0x1.2aff33d629371p-8 <= c15 <= 0x1.41e9bc01ae472p-8
0x1.1e22f3192fd1dp-9 <= c17 <= 0x1.d851520a087c2p-9
Taking c3 = -0x1.b81b44p-3, assuming r8 is also evaluated in float arithmetic:
-0x1.e71df05b5ad56p-4 <= c5 <= -0x1.e7175823ce2a4p-4
0x1.80df529dd8b18p-5 <= c7 <= 0x1.80f00e8da7f58p-5
0x1.1a283503e1a97p-5 <= c9 <= 0x1.1b5ca5beeeefep-5
-0x1.ed2c7cd87f889p-7 <= c11 <= -0x1.e8c17789776cdp-7
-0x1.90759e6defc62p-7 <= c13 <= -0x1.7045e66924732p-7
0x1.27eb51edf324p-8 <= c15 <= 0x1.47cda0bb1f365p-8
0x1.f6c6b51c50b54p-10 <= c17 <= 0x1.003a00ace9a79p-8
Taking c5 = -0x1.e71aa4p-4, assuming r7 is done in float arithmetic:
0x1.80e3dcc972cb3p-5 <= c7 <= 0x1.80ed1cf56977fp-5
0x1.1aa005ff6a6f4p-5 <= c9 <= 0x1.1afce9904742p-5
-0x1.ec7cf2464a893p-7 <= c11 <= -0x1.e9d6f7039db61p-7
-0x1.8a2304daefa26p-7 <= c13 <= -0x1.7a2456ddec8b2p-7
0x1.2e7b48f595544p-8 <= c15 <= 0x1.44437896b7049p-8
0x1.396f76c06de2ep-9 <= c17 <= 0x1.e3bedf4ed606dp-9
Taking c7 = 0x1.80e87cp-5, assuming r6 is done in float arithmetic:
0x1.1aa86d25bb64fp-5 <= c9 <= 0x1.1aca48cd5caabp-5
-0x1.eb6311f6c29dcp-7 <= c11 <= -0x1.eaedb032dfc0cp-7
-0x1.81438f115cbbp-7 <= c13 <= -0x1.7c9a106629f06p-7
0x1.36d433f81a012p-8 <= c15 <= 0x1.3babb57bb55bap-8
0x1.5cb14e1d4247dp-9 <= c17 <= 0x1.84f1151303aedp-9
Taking c9 = 0x1.1ab95ap-5, assuming r5 is done in float arithmetic:
-0x1.eb51a3b03781dp-7 <= c11 <= -0x1.eb21431536e0dp-7
-0x1.7fcd84700f7cfp-7 <= c13 <= -0x1.7ee38ee4beb65p-7
0x1.390fa00abaaabp-8 <= c15 <= 0x1.3b100a7f5d3cep-8
0x1.6ff147e1fdeb4p-9 <= c17 <= 0x1.7ebfed3ab5f9bp-9
I picked a point close to the middle of the range for c11 and randomly chose c13, c15, and c17.
EDIT: I've now automated this procedure. The following function is also a faithfully-rounded implementation of arctan on [0, 1]:
float c5 = 0x1.997a72p-3;
float c7 = -0x1.23176cp-3;
float c9 = 0x1.b523c8p-4;
float c11 = -0x1.358ff8p-4;
float c13 = 0x1.61c5c2p-5;
float c15 = -0x1.0b16e2p-6;
float c17 = 0x1.7b422p-9;
float juffa_poly (float a) {
float s = a * a;
float r1 = c17;
float r2 = fmaf (r1, s, c15);
float r3 = fmaf (r2, s, c13);
float r4 = fmaf (r3, s, c11);
float r5 = fmaf (r4, s, c9);
float r6 = fmaf (r5, s, c7);
float r7 = fmaf (r6, s, c5);
float r8 = fmaf (r7, s, -0x1.5554dap-2f);
float r9 = r8 * s;
float r10 = fmaf (r9, a, a);
return r10;
}
I find it surprising that this code even exists. For coefficients near these, you can get a bound on the distance between r10 and the value of the polynomial evaluated in real arithmetic on the order of a few ulps thanks to the slow convergence of this polynomial when s is near 1. I had expected roundoff error to behave in a way that was fundamentally "untamable" simply by means of tweaking coefficients.
I pondered the various ideas I received in comments and also ran a few experiments based on that feedback. In the end I decided that a refined heuristic search was the best way forward. I have now managed to reduce the maximum error for atanf_poly() to 1.01036 ulps, with just three arguments exceeding my stated goal of a 1 ulp error bound:
ulp = -1.00829 # |a| = 9.80738342e-001 0x1.f62356p-1 (3f7b11ab)
ulp = -1.01036 # |a| = 9.87551928e-001 0x1.f9a068p-1 (3f7cd034)
ulp = 1.00050 # |a| = 9.99375939e-001 0x1.ffae34p-1 (3f7fd71a)
Based on the manner of generating the improved approximation there is no guarantee that this is a best approximation; no scientific breakthrough here. As the ulp error of the current solution is not yet perfectly balanced, and since continuing the search continues to deliver better approximations (albeit at exponentially increasing time intervals) my guess is that a 1 ulp error bound is achievable, but at the same time we seem to be very close to the best machine-optimized approximation already.
The better quality of the new approximation is the result of a refined search process. I observed that all of the largest ulp errors in the polynomial occur close to unity, say in [0.75,1.0] to be conservative. This allows for a fast scan for interesting coefficient sets whose maximum error is smaller than some bound, say 1.08 ulps. I can then test in detail and exhaustively all coefficient sets within a heuristically chosen hyper-cone anchored at that point. This second step searches for minimum ulp error as the primary goal, and maximum percentage of correctly rounded results as a secondary objective. By using this two-step process across all four cores of my CPU I was able to significantly speed up the search process: I have been able to check about 221 coefficient sets so far.
Based on the range of each coefficient across all "close" solutions I now estimate that the total useful search space for this approximation problem is >= 224 coefficient sets rather than the more optimistic number of 220 I threw out before. This seems like a feasible problem to solve for someone who is either very patient or has lots of computational horse-power at their disposal.
My updated code is as follows:
// max ulp err = 1.01036
float atanf_poly (float a)
{
float r, s;
s = a * a;
r = 0x1.7ed22cp-9f;
r = fmaf (r, s, -0x1.0c2c2ep-6f);
r = fmaf (r, s, 0x1.61fdf6p-5f);
r = fmaf (r, s, -0x1.3556b4p-4f);
r = fmaf (r, s, 0x1.b4e12ep-4f);
r = fmaf (r, s, -0x1.230ae0p-3f);
r = fmaf (r, s, 0x1.9978eep-3f);
r = fmaf (r, s, -0x1.5554dap-2f);
r = r * s;
r = fmaf (r, a, a);
return r;
}
// max ulp err = 1.51871
float my_atanf (float a)
{
float r, t;
t = fabsf (a);
r = t;
if (t > 1.0f) {
r = 1.0f / r;
}
r = atanf_poly (r);
if (t > 1.0f) {
r = fmaf (0x1.ddcb02p-1f, 0x1.aee9d6p+0f, -r); // pi/2 - r
}
r = copysignf (r, a);
return r;
}
Update (after revisiting the issue two-and-a-half years later)
Using T. Myklebust's draft publication as a starting point, I found the arctangent approximation on [-1,1] that has the smallest error to have a maximum error of 0.94528 ulp.
/* Based on: Tor Myklebust, "Computing accurate Horner form approximations
to special functions in finite precision arithmetic", arXiv:1508.03211,
August 2015. maximum ulp err = 0.94528
*/
float atanf_poly (float a)
{
float r, s;
s = a * a;
r = 0x1.6d2086p-9f; // 2.78569828e-3
r = fmaf (r, s, -0x1.03f2ecp-6f); // -1.58660226e-2
r = fmaf (r, s, 0x1.5beebap-5f); // 4.24722321e-2
r = fmaf (r, s, -0x1.33194ep-4f); // -7.49753043e-2
r = fmaf (r, s, 0x1.b403a8p-4f); // 1.06448799e-1
r = fmaf (r, s, -0x1.22f5c2p-3f); // -1.42070308e-1
r = fmaf (r, s, 0x1.997748p-3f); // 1.99934542e-1
r = fmaf (r, s, -0x1.5554d8p-2f); // -3.33331466e-1
r = r * s;
r = fmaf (r, a, a);
return r;
}
This is not an answer to the question, but is too long to fit in a comment:
your question is about the optimal choice of coefficients C3, C5, …, C17 in a polynomial approximation to arctangent where you have pinned C1 to 1 and C2, C4, …, C16 to 0.
The title of your question says you are looking for approximations on [-1, 1], and a good reason to pin the even coefficients to 0 is that it is sufficient and necessary for the approximation to be exactly an odd function. The code in your question “contradicts” the title by applying the polynomial approximation only on [0, 1].
If you use the Remez algorithm to look for coefficients C2, C3, …, C8 to a polynomial approximation of arctangent on [0, 1] instead, you may end up with something like the values below:
#include <stdio.h>
#include <math.h>
float atan_poly (float a)
{
float r, s;
s = a;
// s = a * a;
r = -3.3507930064626076153585890630056286726807491543578e-2;
r = fmaf (r, s, 1.3859776280052980081098065189344699108643282883702e-1);
r = fmaf (r, s, -1.8186361916440430105127602496688553126414578766147e-1);
r = fmaf (r, s, -1.4583047494913656326643327729704639191810926020847e-2);
r = fmaf (r, s, 2.1335202878219865228365738728594741358740655881373e-1);
r = fmaf (r, s, -3.6801711826027841250774413728610805847547996647342e-3);
r = fmaf (r, s, -3.3289852243978319173749528028057608377028846413080e-1);
r = fmaf (r, s, -1.8631479933914856903459844359251948006605218562283e-5);
r = fmaf (r, s, 1.2917291732886065585264586294461539492689296797761e-7);
r = fmaf (r, a, a);
return r;
}
int main() {
for (float x = 0.0f; x < 1.0f; x+=0.1f)
printf("x: %f\n%a\n%a\n\n", x, atan_poly(x), atan(x));
}
This has roughly the same complexity as the code in your question—the number of multiplications is similar. Looking at this polynomial, there is no reason in particular to want to pin any coefficient to 0. If we wanted to approximate an odd function over [-1, 1] without pinning the even coefficients, they would automatically come up very small and subject to absorption, and then we would want to pin them to 0, but for this approximation over [0, 1], they don't, so we don't have to pin them.
It could have been better or worse than the odd polynomial in your question. It turns out that it is worse (see below). This quick-and-dirty application of LolRemez 0.2 (code included at the bottom of the question) seems to be, however, good enough to raise the question of the choice of coefficients. I would in particular be curious what happens if you subject the coefficients in this answer to the same “mixture of steepest decent and simulated annealing” optimization step that you applied to get the coefficients in your question.
So, to summarize this remark-posted-as-an-answer, are you sure that you are looking for optimal coefficients C3, C5, …, C17? It seems to me that you are looking for the best sequence of single-precision floating-point operations that produce a faithful approximation to arctangent, and that this approximation does not have to be the Horner form of a degree 17 odd polynomial.
x: 0.000000
0x0p+0
0x0p+0
x: 0.100000
0x1.983e2cp-4
0x1.983e28938f9ecp-4
x: 0.200000
0x1.94442p-3
0x1.94441ff1e8882p-3
x: 0.300000
0x1.2a73a6p-2
0x1.2a73a71dcec16p-2
x: 0.400000
0x1.85a37ap-2
0x1.85a3770ebe7aep-2
x: 0.500000
0x1.dac67p-2
0x1.dac670561bb5p-2
x: 0.600000
0x1.14b1dcp-1
0x1.14b1ddf627649p-1
x: 0.700000
0x1.38b116p-1
0x1.38b113eaa384ep-1
x: 0.800000
0x1.5977a8p-1
0x1.5977a686e0ffbp-1
x: 0.900000
0x1.773388p-1
0x1.77338c44f8faep-1
This is the code that I linked to LolRemez 0.2 in order to optimize the relative accuracy of a degree-9 polynomial approximation of arctangent on [0, 1]:
#include "lol/math/real.h"
#include "lol/math/remez.h"
using lol::real;
using lol::RemezSolver;
real f(real const &y)
{
return (atan(y) - y) / y;
}
real g(real const &y)
{
return re (atan(y) / y);
}
int main(int argc, char **argv)
{
RemezSolver<8, real> solver;
solver.Run("1e-1000", 1.0, f, g, 50);
return 0;
}
This is not an answer, but an extended comment too.
Recent Intel CPUs and some future AMD CPUs have AVX2. In Linux, look for avx2 flag in /proc/cpuinfo to see if your CPU supports these.
AVX2 is an extension that allows us to construct and compute using 256-bit vectors -- for example, eight single-precision numbers, or four double-precision numbers -- instead of just scalars. It includes FMA3 support, meaning fused multiply-add for such vectors. Simply put, AVX2 allows us to evaluate eight polynoms in parallel, in pretty much the same time as we evaluate a single one using scalar operations.
The function error8() analyses one set of coefficients, using predefined values of x, comparing against precalculated values of atan(x), and returns the error in ULPs (below and above the desired result separately), as well as the number of results that match the desired floating-point value exactly. These are not needed for simply testing whether a set of coefficients is better than the currently best known set, but allow different strategies on which coefficients to test. (Basically, the maximum error in ULPs forms a surface, and we're trying to find the lowest point on that surface; knowing the "height" of the surface at each point allows us to make educated guesses as to which direction to go -- how to change the coefficients.)
There are four precalculated tables used: known_x for the arguments, known_f for the correctly-rounded single-precision results, known_a for the double-precision "accurate" value (I'm just hoping the library atan() is precise enough for this -- but one should not rely on it without checking!), and known_m to scale the double-precision difference to ULPs. Given a desired range in arguments, the precalculate() function will precalculate these using the library atan() function. (It also relies on IEEE-754 floating-point formats and float and integer byte order being the same, but this is true on the CPUs this code runs on.)
Note that the known_x, known_f, and known_a arrays could be stored in binary files; the known_m contents are trivially derived from known_a. Using the library atan() without verifying it is not a good idea -- but because mine match njuffa's results, I didn't bother to look for a better reference atan().
For simplicity, here is the code in the form of an example program:
#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <immintrin.h>
#include <math.h>
#include <errno.h>
/** poly8() - Compute eight polynomials in parallel.
* #x - the arguments
* #c - the coefficients.
*
* The first coefficients are for degree 17, the second
* for degree 15, and so on, down to degree 3.
*
* The compiler should vectorize the expression using vfmaddXXXps
* given an AVX2-capable CPU; for example, Intel Haswell,
* Broadwell, Haswell E, Broadwell E, Skylake, or Cannonlake;
* or AMD Excavator CPUs. Tested on Intel Core i5-4200U.
*
* Using GCC-4.8.2 and
* gcc -O2 -march=core-avx2 -mtune=generic
* this code produces assembly (AT&T syntax)
* vmulps %ymm0, %ymm0, %ymm2
* vmovaps (%rdi), %ymm1
* vmovaps %ymm0, %ymm3
* vfmadd213ps 32(%rdi), %ymm2, %ymm1
* vfmadd213ps 64(%rdi), %ymm2, %ymm1
* vfmadd213ps 96(%rdi), %ymm2, %ymm1
* vfmadd213ps 128(%rdi), %ymm2, %ymm1
* vfmadd213ps 160(%rdi), %ymm2, %ymm1
* vfmadd213ps 192(%rdi), %ymm2, %ymm1
* vfmadd213ps 224(%rdi), %ymm2, %ymm1
* vmulps %ymm2, %ymm1, %ymm0
* vfmadd132ps %ymm3, %ymm3, %ymm0
* ret
* if you omit the 'static inline'.
*/
static inline __v8sf poly8(const __v8sf x, const __v8sf *const c)
{
const __v8sf xx = x * x;
return (((((((c[0]*xx + c[1])*xx + c[2])*xx + c[3])*xx + c[4])*xx + c[5])*xx + c[6])*xx + c[7])*xx*x + x;
}
/** error8() - Calculate maximum error in ULPs
* #x - the arguments
* #co - { C17, C15, C13, C11, C9, C7, C5, C3 }
* #f - the correctly rounded results in single precision
* #a - the expected results in double precision
* #m - 16777216.0 raised to the same power of two as #a normalized
* #n - number of vectors to test
* #max_under - pointer to store the maximum underflow (negative, in ULPs) to
* #max_over - pointer to store the maximum overflow (positive, in ULPs) to
* Returns the number of correctly rounded float results, 0..8*n.
*/
size_t error8(const __v8sf *const x, const float *const co,
const __v8sf *const f, const __v4df *const a, const __v4df *const m,
const size_t n,
float *const max_under, float *const max_over)
{
const __v8sf c[8] = { { co[0], co[0], co[0], co[0], co[0], co[0], co[0], co[0] },
{ co[1], co[1], co[1], co[1], co[1], co[1], co[1], co[1] },
{ co[2], co[2], co[2], co[2], co[2], co[2], co[2], co[2] },
{ co[3], co[3], co[3], co[3], co[3], co[3], co[3], co[3] },
{ co[4], co[4], co[4], co[4], co[4], co[4], co[4], co[4] },
{ co[5], co[5], co[5], co[5], co[5], co[5], co[5], co[5] },
{ co[6], co[6], co[6], co[6], co[6], co[6], co[6], co[6] },
{ co[7], co[7], co[7], co[7], co[7], co[7], co[7], co[7] } };
__v4df min = { 0.0, 0.0, 0.0, 0.0 };
__v4df max = { 0.0, 0.0, 0.0, 0.0 };
__v8si eqs = { 0, 0, 0, 0, 0, 0, 0, 0 };
size_t i;
for (i = 0; i < n; i++) {
const __v8sf v = poly8(x[i], c);
const __v4df d0 = { v[0], v[1], v[2], v[3] };
const __v4df d1 = { v[4], v[5], v[6], v[7] };
const __v4df err0 = (d0 - a[2*i+0]) * m[2*i+0];
const __v4df err1 = (d1 - a[2*i+1]) * m[2*i+1];
eqs -= (__v8si)_mm256_cmp_ps(v, f[i], _CMP_EQ_OQ);
min = _mm256_min_pd(min, err0);
max = _mm256_max_pd(max, err1);
min = _mm256_min_pd(min, err1);
max = _mm256_max_pd(max, err0);
}
if (max_under) {
if (min[0] > min[1]) min[0] = min[1];
if (min[0] > min[2]) min[0] = min[2];
if (min[0] > min[3]) min[0] = min[3];
*max_under = min[0];
}
if (max_over) {
if (max[0] < max[1]) max[0] = max[1];
if (max[0] < max[2]) max[0] = max[2];
if (max[0] < max[3]) max[0] = max[3];
*max_over = max[0];
}
return (size_t)((unsigned int)eqs[0])
+ (size_t)((unsigned int)eqs[1])
+ (size_t)((unsigned int)eqs[2])
+ (size_t)((unsigned int)eqs[3])
+ (size_t)((unsigned int)eqs[4])
+ (size_t)((unsigned int)eqs[5])
+ (size_t)((unsigned int)eqs[6])
+ (size_t)((unsigned int)eqs[7]);
}
/** precalculate() - Allocate and precalculate tables for error8().
* #x0 - First argument to precalculate
* #x1 - Last argument to precalculate
* #xptr - Pointer to a __v8sf pointer for the arguments
* #fptr - Pointer to a __v8sf pointer for the correctly rounded results
* #aptr - Pointer to a __v4df pointer for the comparison results
* #mptr - Pointer to a __v4df pointer for the difference multipliers
* Returns the vector count if successful,
* 0 with errno set otherwise.
*/
size_t precalculate(const float x0, const float x1,
__v8sf **const xptr, __v8sf **const fptr,
__v4df **const aptr, __v4df **const mptr)
{
const size_t align = 64;
unsigned int i0, i1;
size_t n, i, sbytes, dbytes;
__v8sf *x = NULL;
__v8sf *f = NULL;
__v4df *a = NULL;
__v4df *m = NULL;
if (!xptr || !fptr || !aptr || !mptr) {
errno = EINVAL;
return (size_t)0;
}
memcpy(&i0, &x0, sizeof i0);
memcpy(&i1, &x1, sizeof i1);
i0 ^= (i0 & 0x80000000U) ? 0xFFFFFFFFU : 0x80000000U;
i1 ^= (i1 & 0x80000000U) ? 0xFFFFFFFFU : 0x80000000U;
if (i1 > i0)
n = (((size_t)i1 - (size_t)i0) | (size_t)7) + (size_t)1;
else
if (i0 > i1)
n = (((size_t)i0 - (size_t)i1) | (size_t)7) + (size_t)1;
else {
errno = EINVAL;
return (size_t)0;
}
sbytes = n * sizeof (float);
if (sbytes % align)
sbytes += align - (sbytes % align);
dbytes = n * sizeof (double);
if (dbytes % align)
dbytes += align - (dbytes % align);
if (posix_memalign((void **)&x, align, sbytes)) {
errno = ENOMEM;
return (size_t)0;
}
if (posix_memalign((void **)&f, align, sbytes)) {
free(x);
errno = ENOMEM;
return (size_t)0;
}
if (posix_memalign((void **)&a, align, dbytes)) {
free(f);
free(x);
errno = ENOMEM;
return (size_t)0;
}
if (posix_memalign((void **)&m, align, dbytes)) {
free(a);
free(f);
free(x);
errno = ENOMEM;
return (size_t)0;
}
if (x1 > x0) {
float *const xp = (float *)x;
float curr = x0;
for (i = 0; i < n; i++) {
xp[i] = curr;
curr = nextafterf(curr, HUGE_VALF);
}
i = n;
while (i-->0 && xp[i] > x1)
xp[i] = x1;
} else {
float *const xp = (float *)x;
float curr = x0;
for (i = 0; i < n; i++) {
xp[i] = curr;
curr = nextafterf(curr, -HUGE_VALF);
}
i = n;
while (i-->0 && xp[i] < x1)
xp[i] = x1;
}
{
const float *const xp = (const float *)x;
float *const fp = (float *)f;
double *const ap = (double *)a;
double *const mp = (double *)m;
for (i = 0; i < n; i++) {
const float curr = xp[i];
int temp;
fp[i] = atanf(curr);
ap[i] = atan((double)curr);
(void)frexp(ap[i], &temp);
mp[i] = ldexp(16777216.0, temp);
}
}
*xptr = x;
*fptr = f;
*aptr = a;
*mptr = m;
errno = 0;
return n/8;
}
static int parse_range(const char *const str, float *const range)
{
float fmin, fmax;
char dummy;
if (sscanf(str, " %f %f %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %f:%f %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %f,%f %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %f/%f %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %ff %ff %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %ff:%ff %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %ff,%ff %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %ff/%ff %c", &fmin, &fmax, &dummy) == 2) {
if (range) {
range[0] = fmin;
range[1] = fmax;
}
return 0;
}
if (sscanf(str, " %f %c", &fmin, &dummy) == 1 ||
sscanf(str, " %ff %c", &fmin, &dummy) == 1) {
if (range) {
range[0] = fmin;
range[1] = fmin;
}
return 0;
}
return errno = ENOENT;
}
static int fix_range(float *const f)
{
if (f && f[0] > f[1]) {
const float tmp = f[0];
f[0] = f[1];
f[1] = tmp;
}
return f && isfinite(f[0]) && isfinite(f[1]) && (f[1] >= f[0]);
}
static const char *f2s(char *const buffer, const size_t size, const float value, const char *const invalid)
{
char format[32];
float parsed;
int decimals, length;
for (decimals = 0; decimals <= 16; decimals++) {
length = snprintf(format, sizeof format, "%%.%df", decimals);
if (length < 1 || length >= (int)sizeof format)
break;
length = snprintf(buffer, size, format, value);
if (length < 1 || length >= (int)size)
break;
if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
return buffer;
decimals++;
}
for (decimals = 0; decimals <= 16; decimals++) {
length = snprintf(format, sizeof format, "%%.%dg", decimals);
if (length < 1 || length >= (int)sizeof format)
break;
length = snprintf(buffer, size, format, value);
if (length < 1 || length >= (int)size)
break;
if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
return buffer;
decimals++;
}
length = snprintf(buffer, size, "%a", value);
if (length < 1 || length >= (int)size)
return invalid;
if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
return buffer;
return invalid;
}
int main(int argc, char *argv[])
{
float xrange[2] = { 0.75f, 1.00f };
float c17range[2], c15range[2], c13range[2], c11range[2];
float c9range[2], c7range[2], c5range[2], c3range[2];
float c[8];
__v8sf *known_x;
__v8sf *known_f;
__v4df *known_a;
__v4df *known_m;
size_t known_n;
if (argc != 10 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
fprintf(stderr, " %s C17 C15 C13 C11 C9 C7 C5 C3 x\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "Each of the coefficients can be a constant or a range,\n");
fprintf(stderr, "for example 0.25 or 0.75:1. x must be a non-empty range.\n");
fprintf(stderr, "\n");
return EXIT_FAILURE;
}
if (parse_range(argv[1], c17range) || !fix_range(c17range)) {
fprintf(stderr, "%s: Invalid C17 range or constant.\n", argv[1]);
return EXIT_FAILURE;
}
if (parse_range(argv[2], c15range) || !fix_range(c15range)) {
fprintf(stderr, "%s: Invalid C15 range or constant.\n", argv[2]);
return EXIT_FAILURE;
}
if (parse_range(argv[3], c13range) || !fix_range(c13range)) {
fprintf(stderr, "%s: Invalid C13 range or constant.\n", argv[3]);
return EXIT_FAILURE;
}
if (parse_range(argv[4], c11range) || !fix_range(c11range)) {
fprintf(stderr, "%s: Invalid C11 range or constant.\n", argv[4]);
return EXIT_FAILURE;
}
if (parse_range(argv[5], c9range) || !fix_range(c9range)) {
fprintf(stderr, "%s: Invalid C9 range or constant.\n", argv[5]);
return EXIT_FAILURE;
}
if (parse_range(argv[6], c7range) || !fix_range(c7range)) {
fprintf(stderr, "%s: Invalid C7 range or constant.\n", argv[6]);
return EXIT_FAILURE;
}
if (parse_range(argv[7], c5range) || !fix_range(c5range)) {
fprintf(stderr, "%s: Invalid C5 range or constant.\n", argv[7]);
return EXIT_FAILURE;
}
if (parse_range(argv[8], c3range) || !fix_range(c3range)) {
fprintf(stderr, "%s: Invalid C3 range or constant.\n", argv[8]);
return EXIT_FAILURE;
}
if (parse_range(argv[9], xrange) || xrange[0] == xrange[1] ||
!isfinite(xrange[0]) || !isfinite(xrange[1])) {
fprintf(stderr, "%s: Invalid x range.\n", argv[9]);
return EXIT_FAILURE;
}
known_n = precalculate(xrange[0], xrange[1], &known_x, &known_f, &known_a, &known_m);
if (!known_n) {
if (errno == ENOMEM)
fprintf(stderr, "Not enough memory for precalculated tables.\n");
else
fprintf(stderr, "Invalid (empty) x range.\n");
return EXIT_FAILURE;
}
fprintf(stderr, "Precalculated %lu arctangents to compare to.\n", 8UL * (unsigned long)known_n);
fprintf(stderr, "\nC17 C15 C13 C11 C9 C7 C5 C3 max-ulps-under max-ulps-above correctly-rounded percentage cycles\n");
fflush(stderr);
{
const double percent = 12.5 / (double)known_n;
size_t rounded;
char c17buffer[64], c15buffer[64], c13buffer[64], c11buffer[64];
char c9buffer[64], c7buffer[64], c5buffer[64], c3buffer[64];
char minbuffer[64], maxbuffer[64];
float minulps, maxulps;
unsigned long tsc_start, tsc_stop;
for (c[0] = c17range[0]; c[0] <= c17range[1]; c[0] = nextafterf(c[0], HUGE_VALF))
for (c[1] = c15range[0]; c[1] <= c15range[1]; c[1] = nextafterf(c[1], HUGE_VALF))
for (c[2] = c13range[0]; c[2] <= c13range[1]; c[2] = nextafterf(c[2], HUGE_VALF))
for (c[3] = c11range[0]; c[3] <= c11range[1]; c[3] = nextafterf(c[3], HUGE_VALF))
for (c[4] = c9range[0]; c[4] <= c9range[1]; c[4] = nextafterf(c[4], HUGE_VALF))
for (c[5] = c7range[0]; c[5] <= c7range[1]; c[5] = nextafterf(c[5], HUGE_VALF))
for (c[6] = c5range[0]; c[6] <= c5range[1]; c[6] = nextafterf(c[6], HUGE_VALF))
for (c[7] = c3range[0]; c[7] <= c3range[1]; c[7] = nextafterf(c[7], HUGE_VALF)) {
tsc_start = __builtin_ia32_rdtsc();
rounded = error8(known_x, c, known_f, known_a, known_m, known_n, &minulps, &maxulps);
tsc_stop = __builtin_ia32_rdtsc();
printf("%-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %lu %.3f %lu\n",
f2s(c17buffer, sizeof c17buffer, c[0], "?"),
f2s(c15buffer, sizeof c15buffer, c[1], "?"),
f2s(c13buffer, sizeof c13buffer, c[2], "?"),
f2s(c11buffer, sizeof c11buffer, c[3], "?"),
f2s(c9buffer, sizeof c9buffer, c[4], "?"),
f2s(c7buffer, sizeof c7buffer, c[5], "?"),
f2s(c5buffer, sizeof c5buffer, c[6], "?"),
f2s(c3buffer, sizeof c3buffer, c[7], "?"),
f2s(minbuffer, sizeof minbuffer, minulps, "?"),
f2s(maxbuffer, sizeof maxbuffer, maxulps, "?"),
rounded, (double)rounded * percent,
(unsigned long)(tsc_stop - tsc_start));
fflush(stdout);
}
}
return EXIT_SUCCESS;
}
The code does compile using GCC-4.8.2 on Linux, but might have to be modified for other compilers and/or OSes. (I'd be happy to include/accept edits fixing those, though. I just don't have Windows or ICC myself so I could check.)
To compile this, I recommend
gcc -Wall -O3 -fomit-frame-pointer -march=native -mtune=native example.c -lm -o example
Run without arguments to see usage; or
./example 0x1.7ed24ap-9f -0x1.0c2c12p-6f 0x1.61fdd2p-5f -0x1.3556b0p-4f 0x1.b4e138p-4f -0x1.230ae2p-3f 0x1.9978eep-3f -0x1.5554dap-2f 0.75:1
to check what it reports for njuffa's coefficient set, compared against standard C library atan() function, with all possible x in [0.75, 1] considered.
Instead of a fixed coefficient, you can also use min:max to define a range to scan (scanning all unique single-precision floating-point values). Each possible combination of the coefficients is tested.
Because I prefer decimal notation, but need to keep the values exact, I use the f2s() function to display the floating-point values. It is a simple brute-force helper function, that uses the shortest formatting that yields the same value when parsed back to float.
For example,
./example 0x1.7ed248p-9f:0x1.7ed24cp-9f -0x1.0c2c10p-6f:-0x1.0c2c14p-6f 0x1.61fdd0p-5f:0x1.61fdd4p-5f -0x1.3556aep-4f:-0x1.3556b2p-4f 0x1.b4e136p-4f:0x1.b4e13ap-4f -0x1.230ae0p-3f:-0x1.230ae4p-3f 0x1.9978ecp-3f:0x1.9978f0p-3f -0x1.5554d8p-2f:-0x1.5554dcp-2f 0.75:1
computes all the 6561 (38) coefficient combinations ±1 ULP around njuffa's set for x in [0.75, 1]. (Indeed, it shows that decreasing C17 by 1 ULP to 0x1.7ed248p-9f yields the exact same results.)
(That run took 90 seconds on Core i5-4200U at 2.6 GHz -- pretty much in line in my estimate of 30 coefficient sets per second per GHz per core. While this code is not threaded, the key functions are thread-safe, so threading should not be too difficult. This Core i5-4200U is a laptop, and gets pretty hot even when stressing just one core, so I didn't bother.)
(I consider the above code to be in public domain, or CC0-licensed where public domain dedication is not possible. In fact, I'm not sure if it is creative enough to be copyrightable at all. Anyway, feel free to use it anywhere in any way you wish, as long as you don't blame me if it breaks.)
Questions? Enhancements? Edits to fix Linux/GCC'isms are welcome!