Best machine-optimized polynomial minimax approximation to arctangent on [-1,1]? - c

For the simple and efficient implementation of fast math functions with reasonable accuracy, polynomial minimax approximations are often the method of choice. Minimax approximations are typically generated with a variant of the Remez algorithm. Various widely available tools such as Maple and Mathematica have built-in functionality for this. The generated coefficients are typically computed using high-precision arithmetic. It is well-known that simply rounding those coefficients to machine precision leads to suboptimal accuracy in the resulting implementation.
Instead, one searches for closely related sets of coefficients that are exactly representable as machine numbers to generate a machine-optimized approximation. Two relevant papers are:
Nicolas Brisebarre, Jean-Michel Muller, and Arnaud Tisserand, "Computing Machine-Efficient Polynomial Approximations", ACM Transactions on Mathematical Software, Vol. 32, No. 2, June 2006, pp. 236–256.
Nicolas Brisebarre and Sylvain Chevillard, "Efficient polynomial L∞-approximations", 18th IEEE Symposium on Computer Arithmetic (ARITH-18), Montpellier (France), June 2007, pp. 169-176.
An implementation of the LLL-algorithm from the latter paper is available as the fpminimax() command of the Sollya tool. It is my understanding that all algorithms proposed for the generation of machine-optimized approximations are based on heuristics, and that it is therefore generally unknown what accuracy can be achieved by an optimal approximation. It is not clear to me whether the availability of FMA (fused multiply-add) for the evaluation of the approximation has an influence on the answer to that question. It seems to me naively that it should.
I am currently looking at a simple polynomial approximation for arctangent on [-1,1] that is evaluated in IEEE-754 single-precision arithmetic, using the Horner scheme and FMA. See function atan_poly() in the C99 code below. For lack of access to a Linux machine at the moment, I did not use Sollya to generate these coefficients, but used my own heuristic that could be loosely described as a mixture of steepest decent and simulated annealing (to avoid getting stuck on local minima). The maximum error of my machine-optimized polynomial is very close to 1 ulp, but ideally I would like the maximum ulp error to be below 1 ulp.
I am aware that I could change my computation to increase the accuracy, for example by using a leading coefficient represented to more than single-precision precision, but I would like to keep the code exactly as is (that is, as simple as possible) adjusting only the coefficients to deliver the most accurate result possible.
A "proven" optimal set of coefficients would be ideal, pointers to relevant literature are welcome. I did a literature search but could not find any paper that advances the state of the art meaningfully beyond Sollya's fpminimax(), and none that examine the role of FMA (if any) in this issue.
// max ulp err = 1.03143
float atan_poly (float a)
{
float r, s;
s = a * a;
r = 0x1.7ed1ccp-9f;
r = fmaf (r, s, -0x1.0c2c08p-6f);
r = fmaf (r, s, 0x1.61fdd0p-5f);
r = fmaf (r, s, -0x1.3556b2p-4f);
r = fmaf (r, s, 0x1.b4e128p-4f);
r = fmaf (r, s, -0x1.230ad2p-3f);
r = fmaf (r, s, 0x1.9978ecp-3f);
r = fmaf (r, s, -0x1.5554dcp-2f);
r = r * s;
r = fmaf (r, a, a);
return r;
}
// max ulp err = 1.52637
float my_atanf (float a)
{
float r, t;
t = fabsf (a);
r = t;
if (t > 1.0f) {
r = 1.0f / r;
}
r = atan_poly (r);
if (t > 1.0f) {
r = fmaf (0x1.ddcb02p-1f, 0x1.aee9d6p+0f, -r); // pi/2 - r
}
r = copysignf (r, a);
return r;
}

The following function is a faithfully-rounded implementation of arctan on [0, 1]:
float atan_poly (float a) {
float s = a * a, u = fmaf(a, -a, 0x1.fde90cp-1f);
float r1 = 0x1.74dfb6p-9f;
float r2 = fmaf (r1, u, 0x1.3a1c7cp-8f);
float r3 = fmaf (r2, s, -0x1.7f24b6p-7f);
float r4 = fmaf (r3, u, -0x1.eb3900p-7f);
float r5 = fmaf (r4, s, 0x1.1ab95ap-5f);
float r6 = fmaf (r5, u, 0x1.80e87cp-5f);
float r7 = fmaf (r6, s, -0x1.e71aa4p-4f);
float r8 = fmaf (r7, u, -0x1.b81b44p-3f);
float r9 = r8 * s;
float r10 = fmaf (r9, a, a);
return r10;
}
The following test harness will abort if the function atan_poly fails to be faithfully-rounded on [1e-16, 1] and print "success" otherwise:
int checkit(float f) {
double d = atan(f);
float d1 = d, d2 = d;
if (d1 < d) d2 = nextafterf(d1, 1.0/0.0);
else d1 = nextafterf(d1, -1.0/0.0);
float p = atan_poly(f);
if (p != d1 && p != d2) return 0;
return 1;
}
int main() {
for (float f = 1; f > 1e-16; f = nextafterf(f, -1.0/0.0)) {
if (!checkit(f)) abort();
}
printf("success\n");
exit(0);
}
The problem with using s in every multiplication is that the polynomial's coefficients do not decay rapidly. Inputs close to 1 result in lots and lots of cancellation of nearly equal numbers, meaning you're trying to find a set of coefficients so that the accumulated roundoff at the end of the computation closely approximates the residual of arctan.
The constant 0x1.fde90cp-1f is a number close to 1 for which (arctan(sqrt(x)) - x) / x^3 is very close to the nearest float. That is, it's a constant that goes into the computation of u so that the cubic coefficient is almost completely determined. (For this program, the cubic coefficient must be either -0x1.b81b44p-3f or -0x1.b81b42p-3f.)
Alternating multiplications by s and u has the effect of reducing the effect of roundoff error in ri upon r{i+2} by a factor of at most 1/4, since s*u < 1/4 whatever a is. This gives considerable leeway in choosing the coefficients of fifth order and beyond.
I found the coefficients with the aid of two programs:
One program plugs in a bunch of test points, writes down a system of linear inequalities, and computes bounds on the coefficients from that system of inequalities. Notice that, given a, one can compute the range of r8 that lead to a faithfully-rounded result. To get linear inequalities, I pretended r8 would be computed as a polynomial in the floats s and u in real-number arithmetic; the linear inequalities constrained this real-number r8 to lie in some interval. I used the Parma Polyhedra Library to handle these constraint systems.
Another program randomly tested sets of coefficients in certain ranges, plugging in first a set of test points and then all floats from 1 to 1e-8 in descending order and checking that atan_poly produces a faithful rounding of atan((double)x). If some x failed, it printed out that x and why it failed.
To get coefficients, I hacked this first program to fix c3, work out bounds on r7 for each test point, then get bounds on the higher-order coefficients. Then I hacked it to fix c3 and c5 and get bounds on the higher-order coefficients. I did this until I had all but the three highest-order coefficients, c13, c15, and c17.
I grew the set of test points in the second program until it either stopped printing anything out or printed out "success". I needed surprisingly few test points to reject almost all wrong polynomials---I count 85 test points in the program.
Here I show some of my work selecting the coefficients. In order to get a faithfully-rounded arctan for my initial set of test points assuming r1 through r8 are evaluated in real arithmetic (and rounded somehow unpleasantly but in a way I can't remember) but r9 and r10 are evaluated in float arithmetic, I need:
-0x1.b81b456625f15p-3 <= c3 <= -0x1.b81b416e22329p-3
-0x1.e71d48d9c2ca4p-4 <= c5 <= -0x1.e71783472f5d1p-4
0x1.80e063cb210f9p-5 <= c7 <= 0x1.80ed6efa0a369p-5
0x1.1a3925ea0c5a9p-5 <= c9 <= 0x1.1b3783f148ed8p-5
-0x1.ec6032f293143p-7 <= c11 <= -0x1.e928025d508p-7
-0x1.8c06e851e2255p-7 <= c13 <= -0x1.732b2d4677028p-7
0x1.2aff33d629371p-8 <= c15 <= 0x1.41e9bc01ae472p-8
0x1.1e22f3192fd1dp-9 <= c17 <= 0x1.d851520a087c2p-9
Taking c3 = -0x1.b81b44p-3, assuming r8 is also evaluated in float arithmetic:
-0x1.e71df05b5ad56p-4 <= c5 <= -0x1.e7175823ce2a4p-4
0x1.80df529dd8b18p-5 <= c7 <= 0x1.80f00e8da7f58p-5
0x1.1a283503e1a97p-5 <= c9 <= 0x1.1b5ca5beeeefep-5
-0x1.ed2c7cd87f889p-7 <= c11 <= -0x1.e8c17789776cdp-7
-0x1.90759e6defc62p-7 <= c13 <= -0x1.7045e66924732p-7
0x1.27eb51edf324p-8 <= c15 <= 0x1.47cda0bb1f365p-8
0x1.f6c6b51c50b54p-10 <= c17 <= 0x1.003a00ace9a79p-8
Taking c5 = -0x1.e71aa4p-4, assuming r7 is done in float arithmetic:
0x1.80e3dcc972cb3p-5 <= c7 <= 0x1.80ed1cf56977fp-5
0x1.1aa005ff6a6f4p-5 <= c9 <= 0x1.1afce9904742p-5
-0x1.ec7cf2464a893p-7 <= c11 <= -0x1.e9d6f7039db61p-7
-0x1.8a2304daefa26p-7 <= c13 <= -0x1.7a2456ddec8b2p-7
0x1.2e7b48f595544p-8 <= c15 <= 0x1.44437896b7049p-8
0x1.396f76c06de2ep-9 <= c17 <= 0x1.e3bedf4ed606dp-9
Taking c7 = 0x1.80e87cp-5, assuming r6 is done in float arithmetic:
0x1.1aa86d25bb64fp-5 <= c9 <= 0x1.1aca48cd5caabp-5
-0x1.eb6311f6c29dcp-7 <= c11 <= -0x1.eaedb032dfc0cp-7
-0x1.81438f115cbbp-7 <= c13 <= -0x1.7c9a106629f06p-7
0x1.36d433f81a012p-8 <= c15 <= 0x1.3babb57bb55bap-8
0x1.5cb14e1d4247dp-9 <= c17 <= 0x1.84f1151303aedp-9
Taking c9 = 0x1.1ab95ap-5, assuming r5 is done in float arithmetic:
-0x1.eb51a3b03781dp-7 <= c11 <= -0x1.eb21431536e0dp-7
-0x1.7fcd84700f7cfp-7 <= c13 <= -0x1.7ee38ee4beb65p-7
0x1.390fa00abaaabp-8 <= c15 <= 0x1.3b100a7f5d3cep-8
0x1.6ff147e1fdeb4p-9 <= c17 <= 0x1.7ebfed3ab5f9bp-9
I picked a point close to the middle of the range for c11 and randomly chose c13, c15, and c17.
EDIT: I've now automated this procedure. The following function is also a faithfully-rounded implementation of arctan on [0, 1]:
float c5 = 0x1.997a72p-3;
float c7 = -0x1.23176cp-3;
float c9 = 0x1.b523c8p-4;
float c11 = -0x1.358ff8p-4;
float c13 = 0x1.61c5c2p-5;
float c15 = -0x1.0b16e2p-6;
float c17 = 0x1.7b422p-9;
float juffa_poly (float a) {
float s = a * a;
float r1 = c17;
float r2 = fmaf (r1, s, c15);
float r3 = fmaf (r2, s, c13);
float r4 = fmaf (r3, s, c11);
float r5 = fmaf (r4, s, c9);
float r6 = fmaf (r5, s, c7);
float r7 = fmaf (r6, s, c5);
float r8 = fmaf (r7, s, -0x1.5554dap-2f);
float r9 = r8 * s;
float r10 = fmaf (r9, a, a);
return r10;
}
I find it surprising that this code even exists. For coefficients near these, you can get a bound on the distance between r10 and the value of the polynomial evaluated in real arithmetic on the order of a few ulps thanks to the slow convergence of this polynomial when s is near 1. I had expected roundoff error to behave in a way that was fundamentally "untamable" simply by means of tweaking coefficients.

I pondered the various ideas I received in comments and also ran a few experiments based on that feedback. In the end I decided that a refined heuristic search was the best way forward. I have now managed to reduce the maximum error for atanf_poly() to 1.01036 ulps, with just three arguments exceeding my stated goal of a 1 ulp error bound:
ulp = -1.00829 # |a| = 9.80738342e-001 0x1.f62356p-1 (3f7b11ab)
ulp = -1.01036 # |a| = 9.87551928e-001 0x1.f9a068p-1 (3f7cd034)
ulp = 1.00050 # |a| = 9.99375939e-001 0x1.ffae34p-1 (3f7fd71a)
Based on the manner of generating the improved approximation there is no guarantee that this is a best approximation; no scientific breakthrough here. As the ulp error of the current solution is not yet perfectly balanced, and since continuing the search continues to deliver better approximations (albeit at exponentially increasing time intervals) my guess is that a 1 ulp error bound is achievable, but at the same time we seem to be very close to the best machine-optimized approximation already.
The better quality of the new approximation is the result of a refined search process. I observed that all of the largest ulp errors in the polynomial occur close to unity, say in [0.75,1.0] to be conservative. This allows for a fast scan for interesting coefficient sets whose maximum error is smaller than some bound, say 1.08 ulps. I can then test in detail and exhaustively all coefficient sets within a heuristically chosen hyper-cone anchored at that point. This second step searches for minimum ulp error as the primary goal, and maximum percentage of correctly rounded results as a secondary objective. By using this two-step process across all four cores of my CPU I was able to significantly speed up the search process: I have been able to check about 221 coefficient sets so far.
Based on the range of each coefficient across all "close" solutions I now estimate that the total useful search space for this approximation problem is >= 224 coefficient sets rather than the more optimistic number of 220 I threw out before. This seems like a feasible problem to solve for someone who is either very patient or has lots of computational horse-power at their disposal.
My updated code is as follows:
// max ulp err = 1.01036
float atanf_poly (float a)
{
float r, s;
s = a * a;
r = 0x1.7ed22cp-9f;
r = fmaf (r, s, -0x1.0c2c2ep-6f);
r = fmaf (r, s, 0x1.61fdf6p-5f);
r = fmaf (r, s, -0x1.3556b4p-4f);
r = fmaf (r, s, 0x1.b4e12ep-4f);
r = fmaf (r, s, -0x1.230ae0p-3f);
r = fmaf (r, s, 0x1.9978eep-3f);
r = fmaf (r, s, -0x1.5554dap-2f);
r = r * s;
r = fmaf (r, a, a);
return r;
}
// max ulp err = 1.51871
float my_atanf (float a)
{
float r, t;
t = fabsf (a);
r = t;
if (t > 1.0f) {
r = 1.0f / r;
}
r = atanf_poly (r);
if (t > 1.0f) {
r = fmaf (0x1.ddcb02p-1f, 0x1.aee9d6p+0f, -r); // pi/2 - r
}
r = copysignf (r, a);
return r;
}
Update (after revisiting the issue two-and-a-half years later)
Using T. Myklebust's draft publication as a starting point, I found the arctangent approximation on [-1,1] that has the smallest error to have a maximum error of 0.94528 ulp.
/* Based on: Tor Myklebust, "Computing accurate Horner form approximations
to special functions in finite precision arithmetic", arXiv:1508.03211,
August 2015. maximum ulp err = 0.94528
*/
float atanf_poly (float a)
{
float r, s;
s = a * a;
r = 0x1.6d2086p-9f; // 2.78569828e-3
r = fmaf (r, s, -0x1.03f2ecp-6f); // -1.58660226e-2
r = fmaf (r, s, 0x1.5beebap-5f); // 4.24722321e-2
r = fmaf (r, s, -0x1.33194ep-4f); // -7.49753043e-2
r = fmaf (r, s, 0x1.b403a8p-4f); // 1.06448799e-1
r = fmaf (r, s, -0x1.22f5c2p-3f); // -1.42070308e-1
r = fmaf (r, s, 0x1.997748p-3f); // 1.99934542e-1
r = fmaf (r, s, -0x1.5554d8p-2f); // -3.33331466e-1
r = r * s;
r = fmaf (r, a, a);
return r;
}

This is not an answer to the question, but is too long to fit in a comment:
your question is about the optimal choice of coefficients C3, C5, …, C17 in a polynomial approximation to arctangent where you have pinned C1 to 1 and C2, C4, …, C16 to 0.
The title of your question says you are looking for approximations on [-1, 1], and a good reason to pin the even coefficients to 0 is that it is sufficient and necessary for the approximation to be exactly an odd function. The code in your question “contradicts” the title by applying the polynomial approximation only on [0, 1].
If you use the Remez algorithm to look for coefficients C2, C3, …, C8 to a polynomial approximation of arctangent on [0, 1] instead, you may end up with something like the values below:
#include <stdio.h>
#include <math.h>
float atan_poly (float a)
{
float r, s;
s = a;
// s = a * a;
r = -3.3507930064626076153585890630056286726807491543578e-2;
r = fmaf (r, s, 1.3859776280052980081098065189344699108643282883702e-1);
r = fmaf (r, s, -1.8186361916440430105127602496688553126414578766147e-1);
r = fmaf (r, s, -1.4583047494913656326643327729704639191810926020847e-2);
r = fmaf (r, s, 2.1335202878219865228365738728594741358740655881373e-1);
r = fmaf (r, s, -3.6801711826027841250774413728610805847547996647342e-3);
r = fmaf (r, s, -3.3289852243978319173749528028057608377028846413080e-1);
r = fmaf (r, s, -1.8631479933914856903459844359251948006605218562283e-5);
r = fmaf (r, s, 1.2917291732886065585264586294461539492689296797761e-7);
r = fmaf (r, a, a);
return r;
}
int main() {
for (float x = 0.0f; x < 1.0f; x+=0.1f)
printf("x: %f\n%a\n%a\n\n", x, atan_poly(x), atan(x));
}
This has roughly the same complexity as the code in your question—the number of multiplications is similar. Looking at this polynomial, there is no reason in particular to want to pin any coefficient to 0. If we wanted to approximate an odd function over [-1, 1] without pinning the even coefficients, they would automatically come up very small and subject to absorption, and then we would want to pin them to 0, but for this approximation over [0, 1], they don't, so we don't have to pin them.
It could have been better or worse than the odd polynomial in your question. It turns out that it is worse (see below). This quick-and-dirty application of LolRemez 0.2 (code included at the bottom of the question) seems to be, however, good enough to raise the question of the choice of coefficients. I would in particular be curious what happens if you subject the coefficients in this answer to the same “mixture of steepest decent and simulated annealing” optimization step that you applied to get the coefficients in your question.
So, to summarize this remark-posted-as-an-answer, are you sure that you are looking for optimal coefficients C3, C5, …, C17? It seems to me that you are looking for the best sequence of single-precision floating-point operations that produce a faithful approximation to arctangent, and that this approximation does not have to be the Horner form of a degree 17 odd polynomial.
x: 0.000000
0x0p+0
0x0p+0
x: 0.100000
0x1.983e2cp-4
0x1.983e28938f9ecp-4
x: 0.200000
0x1.94442p-3
0x1.94441ff1e8882p-3
x: 0.300000
0x1.2a73a6p-2
0x1.2a73a71dcec16p-2
x: 0.400000
0x1.85a37ap-2
0x1.85a3770ebe7aep-2
x: 0.500000
0x1.dac67p-2
0x1.dac670561bb5p-2
x: 0.600000
0x1.14b1dcp-1
0x1.14b1ddf627649p-1
x: 0.700000
0x1.38b116p-1
0x1.38b113eaa384ep-1
x: 0.800000
0x1.5977a8p-1
0x1.5977a686e0ffbp-1
x: 0.900000
0x1.773388p-1
0x1.77338c44f8faep-1
This is the code that I linked to LolRemez 0.2 in order to optimize the relative accuracy of a degree-9 polynomial approximation of arctangent on [0, 1]:
#include "lol/math/real.h"
#include "lol/math/remez.h"
using lol::real;
using lol::RemezSolver;
real f(real const &y)
{
return (atan(y) - y) / y;
}
real g(real const &y)
{
return re (atan(y) / y);
}
int main(int argc, char **argv)
{
RemezSolver<8, real> solver;
solver.Run("1e-1000", 1.0, f, g, 50);
return 0;
}

This is not an answer, but an extended comment too.
Recent Intel CPUs and some future AMD CPUs have AVX2. In Linux, look for avx2 flag in /proc/cpuinfo to see if your CPU supports these.
AVX2 is an extension that allows us to construct and compute using 256-bit vectors -- for example, eight single-precision numbers, or four double-precision numbers -- instead of just scalars. It includes FMA3 support, meaning fused multiply-add for such vectors. Simply put, AVX2 allows us to evaluate eight polynoms in parallel, in pretty much the same time as we evaluate a single one using scalar operations.
The function error8() analyses one set of coefficients, using predefined values of x, comparing against precalculated values of atan(x), and returns the error in ULPs (below and above the desired result separately), as well as the number of results that match the desired floating-point value exactly. These are not needed for simply testing whether a set of coefficients is better than the currently best known set, but allow different strategies on which coefficients to test. (Basically, the maximum error in ULPs forms a surface, and we're trying to find the lowest point on that surface; knowing the "height" of the surface at each point allows us to make educated guesses as to which direction to go -- how to change the coefficients.)
There are four precalculated tables used: known_x for the arguments, known_f for the correctly-rounded single-precision results, known_a for the double-precision "accurate" value (I'm just hoping the library atan() is precise enough for this -- but one should not rely on it without checking!), and known_m to scale the double-precision difference to ULPs. Given a desired range in arguments, the precalculate() function will precalculate these using the library atan() function. (It also relies on IEEE-754 floating-point formats and float and integer byte order being the same, but this is true on the CPUs this code runs on.)
Note that the known_x, known_f, and known_a arrays could be stored in binary files; the known_m contents are trivially derived from known_a. Using the library atan() without verifying it is not a good idea -- but because mine match njuffa's results, I didn't bother to look for a better reference atan().
For simplicity, here is the code in the form of an example program:
#define _POSIX_C_SOURCE 200809L
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <immintrin.h>
#include <math.h>
#include <errno.h>
/** poly8() - Compute eight polynomials in parallel.
* #x - the arguments
* #c - the coefficients.
*
* The first coefficients are for degree 17, the second
* for degree 15, and so on, down to degree 3.
*
* The compiler should vectorize the expression using vfmaddXXXps
* given an AVX2-capable CPU; for example, Intel Haswell,
* Broadwell, Haswell E, Broadwell E, Skylake, or Cannonlake;
* or AMD Excavator CPUs. Tested on Intel Core i5-4200U.
*
* Using GCC-4.8.2 and
* gcc -O2 -march=core-avx2 -mtune=generic
* this code produces assembly (AT&T syntax)
* vmulps %ymm0, %ymm0, %ymm2
* vmovaps (%rdi), %ymm1
* vmovaps %ymm0, %ymm3
* vfmadd213ps 32(%rdi), %ymm2, %ymm1
* vfmadd213ps 64(%rdi), %ymm2, %ymm1
* vfmadd213ps 96(%rdi), %ymm2, %ymm1
* vfmadd213ps 128(%rdi), %ymm2, %ymm1
* vfmadd213ps 160(%rdi), %ymm2, %ymm1
* vfmadd213ps 192(%rdi), %ymm2, %ymm1
* vfmadd213ps 224(%rdi), %ymm2, %ymm1
* vmulps %ymm2, %ymm1, %ymm0
* vfmadd132ps %ymm3, %ymm3, %ymm0
* ret
* if you omit the 'static inline'.
*/
static inline __v8sf poly8(const __v8sf x, const __v8sf *const c)
{
const __v8sf xx = x * x;
return (((((((c[0]*xx + c[1])*xx + c[2])*xx + c[3])*xx + c[4])*xx + c[5])*xx + c[6])*xx + c[7])*xx*x + x;
}
/** error8() - Calculate maximum error in ULPs
* #x - the arguments
* #co - { C17, C15, C13, C11, C9, C7, C5, C3 }
* #f - the correctly rounded results in single precision
* #a - the expected results in double precision
* #m - 16777216.0 raised to the same power of two as #a normalized
* #n - number of vectors to test
* #max_under - pointer to store the maximum underflow (negative, in ULPs) to
* #max_over - pointer to store the maximum overflow (positive, in ULPs) to
* Returns the number of correctly rounded float results, 0..8*n.
*/
size_t error8(const __v8sf *const x, const float *const co,
const __v8sf *const f, const __v4df *const a, const __v4df *const m,
const size_t n,
float *const max_under, float *const max_over)
{
const __v8sf c[8] = { { co[0], co[0], co[0], co[0], co[0], co[0], co[0], co[0] },
{ co[1], co[1], co[1], co[1], co[1], co[1], co[1], co[1] },
{ co[2], co[2], co[2], co[2], co[2], co[2], co[2], co[2] },
{ co[3], co[3], co[3], co[3], co[3], co[3], co[3], co[3] },
{ co[4], co[4], co[4], co[4], co[4], co[4], co[4], co[4] },
{ co[5], co[5], co[5], co[5], co[5], co[5], co[5], co[5] },
{ co[6], co[6], co[6], co[6], co[6], co[6], co[6], co[6] },
{ co[7], co[7], co[7], co[7], co[7], co[7], co[7], co[7] } };
__v4df min = { 0.0, 0.0, 0.0, 0.0 };
__v4df max = { 0.0, 0.0, 0.0, 0.0 };
__v8si eqs = { 0, 0, 0, 0, 0, 0, 0, 0 };
size_t i;
for (i = 0; i < n; i++) {
const __v8sf v = poly8(x[i], c);
const __v4df d0 = { v[0], v[1], v[2], v[3] };
const __v4df d1 = { v[4], v[5], v[6], v[7] };
const __v4df err0 = (d0 - a[2*i+0]) * m[2*i+0];
const __v4df err1 = (d1 - a[2*i+1]) * m[2*i+1];
eqs -= (__v8si)_mm256_cmp_ps(v, f[i], _CMP_EQ_OQ);
min = _mm256_min_pd(min, err0);
max = _mm256_max_pd(max, err1);
min = _mm256_min_pd(min, err1);
max = _mm256_max_pd(max, err0);
}
if (max_under) {
if (min[0] > min[1]) min[0] = min[1];
if (min[0] > min[2]) min[0] = min[2];
if (min[0] > min[3]) min[0] = min[3];
*max_under = min[0];
}
if (max_over) {
if (max[0] < max[1]) max[0] = max[1];
if (max[0] < max[2]) max[0] = max[2];
if (max[0] < max[3]) max[0] = max[3];
*max_over = max[0];
}
return (size_t)((unsigned int)eqs[0])
+ (size_t)((unsigned int)eqs[1])
+ (size_t)((unsigned int)eqs[2])
+ (size_t)((unsigned int)eqs[3])
+ (size_t)((unsigned int)eqs[4])
+ (size_t)((unsigned int)eqs[5])
+ (size_t)((unsigned int)eqs[6])
+ (size_t)((unsigned int)eqs[7]);
}
/** precalculate() - Allocate and precalculate tables for error8().
* #x0 - First argument to precalculate
* #x1 - Last argument to precalculate
* #xptr - Pointer to a __v8sf pointer for the arguments
* #fptr - Pointer to a __v8sf pointer for the correctly rounded results
* #aptr - Pointer to a __v4df pointer for the comparison results
* #mptr - Pointer to a __v4df pointer for the difference multipliers
* Returns the vector count if successful,
* 0 with errno set otherwise.
*/
size_t precalculate(const float x0, const float x1,
__v8sf **const xptr, __v8sf **const fptr,
__v4df **const aptr, __v4df **const mptr)
{
const size_t align = 64;
unsigned int i0, i1;
size_t n, i, sbytes, dbytes;
__v8sf *x = NULL;
__v8sf *f = NULL;
__v4df *a = NULL;
__v4df *m = NULL;
if (!xptr || !fptr || !aptr || !mptr) {
errno = EINVAL;
return (size_t)0;
}
memcpy(&i0, &x0, sizeof i0);
memcpy(&i1, &x1, sizeof i1);
i0 ^= (i0 & 0x80000000U) ? 0xFFFFFFFFU : 0x80000000U;
i1 ^= (i1 & 0x80000000U) ? 0xFFFFFFFFU : 0x80000000U;
if (i1 > i0)
n = (((size_t)i1 - (size_t)i0) | (size_t)7) + (size_t)1;
else
if (i0 > i1)
n = (((size_t)i0 - (size_t)i1) | (size_t)7) + (size_t)1;
else {
errno = EINVAL;
return (size_t)0;
}
sbytes = n * sizeof (float);
if (sbytes % align)
sbytes += align - (sbytes % align);
dbytes = n * sizeof (double);
if (dbytes % align)
dbytes += align - (dbytes % align);
if (posix_memalign((void **)&x, align, sbytes)) {
errno = ENOMEM;
return (size_t)0;
}
if (posix_memalign((void **)&f, align, sbytes)) {
free(x);
errno = ENOMEM;
return (size_t)0;
}
if (posix_memalign((void **)&a, align, dbytes)) {
free(f);
free(x);
errno = ENOMEM;
return (size_t)0;
}
if (posix_memalign((void **)&m, align, dbytes)) {
free(a);
free(f);
free(x);
errno = ENOMEM;
return (size_t)0;
}
if (x1 > x0) {
float *const xp = (float *)x;
float curr = x0;
for (i = 0; i < n; i++) {
xp[i] = curr;
curr = nextafterf(curr, HUGE_VALF);
}
i = n;
while (i-->0 && xp[i] > x1)
xp[i] = x1;
} else {
float *const xp = (float *)x;
float curr = x0;
for (i = 0; i < n; i++) {
xp[i] = curr;
curr = nextafterf(curr, -HUGE_VALF);
}
i = n;
while (i-->0 && xp[i] < x1)
xp[i] = x1;
}
{
const float *const xp = (const float *)x;
float *const fp = (float *)f;
double *const ap = (double *)a;
double *const mp = (double *)m;
for (i = 0; i < n; i++) {
const float curr = xp[i];
int temp;
fp[i] = atanf(curr);
ap[i] = atan((double)curr);
(void)frexp(ap[i], &temp);
mp[i] = ldexp(16777216.0, temp);
}
}
*xptr = x;
*fptr = f;
*aptr = a;
*mptr = m;
errno = 0;
return n/8;
}
static int parse_range(const char *const str, float *const range)
{
float fmin, fmax;
char dummy;
if (sscanf(str, " %f %f %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %f:%f %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %f,%f %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %f/%f %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %ff %ff %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %ff:%ff %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %ff,%ff %c", &fmin, &fmax, &dummy) == 2 ||
sscanf(str, " %ff/%ff %c", &fmin, &fmax, &dummy) == 2) {
if (range) {
range[0] = fmin;
range[1] = fmax;
}
return 0;
}
if (sscanf(str, " %f %c", &fmin, &dummy) == 1 ||
sscanf(str, " %ff %c", &fmin, &dummy) == 1) {
if (range) {
range[0] = fmin;
range[1] = fmin;
}
return 0;
}
return errno = ENOENT;
}
static int fix_range(float *const f)
{
if (f && f[0] > f[1]) {
const float tmp = f[0];
f[0] = f[1];
f[1] = tmp;
}
return f && isfinite(f[0]) && isfinite(f[1]) && (f[1] >= f[0]);
}
static const char *f2s(char *const buffer, const size_t size, const float value, const char *const invalid)
{
char format[32];
float parsed;
int decimals, length;
for (decimals = 0; decimals <= 16; decimals++) {
length = snprintf(format, sizeof format, "%%.%df", decimals);
if (length < 1 || length >= (int)sizeof format)
break;
length = snprintf(buffer, size, format, value);
if (length < 1 || length >= (int)size)
break;
if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
return buffer;
decimals++;
}
for (decimals = 0; decimals <= 16; decimals++) {
length = snprintf(format, sizeof format, "%%.%dg", decimals);
if (length < 1 || length >= (int)sizeof format)
break;
length = snprintf(buffer, size, format, value);
if (length < 1 || length >= (int)size)
break;
if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
return buffer;
decimals++;
}
length = snprintf(buffer, size, "%a", value);
if (length < 1 || length >= (int)size)
return invalid;
if (sscanf(buffer, "%f", &parsed) == 1 && parsed == value)
return buffer;
return invalid;
}
int main(int argc, char *argv[])
{
float xrange[2] = { 0.75f, 1.00f };
float c17range[2], c15range[2], c13range[2], c11range[2];
float c9range[2], c7range[2], c5range[2], c3range[2];
float c[8];
__v8sf *known_x;
__v8sf *known_f;
__v4df *known_a;
__v4df *known_m;
size_t known_n;
if (argc != 10 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
fprintf(stderr, " %s C17 C15 C13 C11 C9 C7 C5 C3 x\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "Each of the coefficients can be a constant or a range,\n");
fprintf(stderr, "for example 0.25 or 0.75:1. x must be a non-empty range.\n");
fprintf(stderr, "\n");
return EXIT_FAILURE;
}
if (parse_range(argv[1], c17range) || !fix_range(c17range)) {
fprintf(stderr, "%s: Invalid C17 range or constant.\n", argv[1]);
return EXIT_FAILURE;
}
if (parse_range(argv[2], c15range) || !fix_range(c15range)) {
fprintf(stderr, "%s: Invalid C15 range or constant.\n", argv[2]);
return EXIT_FAILURE;
}
if (parse_range(argv[3], c13range) || !fix_range(c13range)) {
fprintf(stderr, "%s: Invalid C13 range or constant.\n", argv[3]);
return EXIT_FAILURE;
}
if (parse_range(argv[4], c11range) || !fix_range(c11range)) {
fprintf(stderr, "%s: Invalid C11 range or constant.\n", argv[4]);
return EXIT_FAILURE;
}
if (parse_range(argv[5], c9range) || !fix_range(c9range)) {
fprintf(stderr, "%s: Invalid C9 range or constant.\n", argv[5]);
return EXIT_FAILURE;
}
if (parse_range(argv[6], c7range) || !fix_range(c7range)) {
fprintf(stderr, "%s: Invalid C7 range or constant.\n", argv[6]);
return EXIT_FAILURE;
}
if (parse_range(argv[7], c5range) || !fix_range(c5range)) {
fprintf(stderr, "%s: Invalid C5 range or constant.\n", argv[7]);
return EXIT_FAILURE;
}
if (parse_range(argv[8], c3range) || !fix_range(c3range)) {
fprintf(stderr, "%s: Invalid C3 range or constant.\n", argv[8]);
return EXIT_FAILURE;
}
if (parse_range(argv[9], xrange) || xrange[0] == xrange[1] ||
!isfinite(xrange[0]) || !isfinite(xrange[1])) {
fprintf(stderr, "%s: Invalid x range.\n", argv[9]);
return EXIT_FAILURE;
}
known_n = precalculate(xrange[0], xrange[1], &known_x, &known_f, &known_a, &known_m);
if (!known_n) {
if (errno == ENOMEM)
fprintf(stderr, "Not enough memory for precalculated tables.\n");
else
fprintf(stderr, "Invalid (empty) x range.\n");
return EXIT_FAILURE;
}
fprintf(stderr, "Precalculated %lu arctangents to compare to.\n", 8UL * (unsigned long)known_n);
fprintf(stderr, "\nC17 C15 C13 C11 C9 C7 C5 C3 max-ulps-under max-ulps-above correctly-rounded percentage cycles\n");
fflush(stderr);
{
const double percent = 12.5 / (double)known_n;
size_t rounded;
char c17buffer[64], c15buffer[64], c13buffer[64], c11buffer[64];
char c9buffer[64], c7buffer[64], c5buffer[64], c3buffer[64];
char minbuffer[64], maxbuffer[64];
float minulps, maxulps;
unsigned long tsc_start, tsc_stop;
for (c[0] = c17range[0]; c[0] <= c17range[1]; c[0] = nextafterf(c[0], HUGE_VALF))
for (c[1] = c15range[0]; c[1] <= c15range[1]; c[1] = nextafterf(c[1], HUGE_VALF))
for (c[2] = c13range[0]; c[2] <= c13range[1]; c[2] = nextafterf(c[2], HUGE_VALF))
for (c[3] = c11range[0]; c[3] <= c11range[1]; c[3] = nextafterf(c[3], HUGE_VALF))
for (c[4] = c9range[0]; c[4] <= c9range[1]; c[4] = nextafterf(c[4], HUGE_VALF))
for (c[5] = c7range[0]; c[5] <= c7range[1]; c[5] = nextafterf(c[5], HUGE_VALF))
for (c[6] = c5range[0]; c[6] <= c5range[1]; c[6] = nextafterf(c[6], HUGE_VALF))
for (c[7] = c3range[0]; c[7] <= c3range[1]; c[7] = nextafterf(c[7], HUGE_VALF)) {
tsc_start = __builtin_ia32_rdtsc();
rounded = error8(known_x, c, known_f, known_a, known_m, known_n, &minulps, &maxulps);
tsc_stop = __builtin_ia32_rdtsc();
printf("%-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %-13s %lu %.3f %lu\n",
f2s(c17buffer, sizeof c17buffer, c[0], "?"),
f2s(c15buffer, sizeof c15buffer, c[1], "?"),
f2s(c13buffer, sizeof c13buffer, c[2], "?"),
f2s(c11buffer, sizeof c11buffer, c[3], "?"),
f2s(c9buffer, sizeof c9buffer, c[4], "?"),
f2s(c7buffer, sizeof c7buffer, c[5], "?"),
f2s(c5buffer, sizeof c5buffer, c[6], "?"),
f2s(c3buffer, sizeof c3buffer, c[7], "?"),
f2s(minbuffer, sizeof minbuffer, minulps, "?"),
f2s(maxbuffer, sizeof maxbuffer, maxulps, "?"),
rounded, (double)rounded * percent,
(unsigned long)(tsc_stop - tsc_start));
fflush(stdout);
}
}
return EXIT_SUCCESS;
}
The code does compile using GCC-4.8.2 on Linux, but might have to be modified for other compilers and/or OSes. (I'd be happy to include/accept edits fixing those, though. I just don't have Windows or ICC myself so I could check.)
To compile this, I recommend
gcc -Wall -O3 -fomit-frame-pointer -march=native -mtune=native example.c -lm -o example
Run without arguments to see usage; or
./example 0x1.7ed24ap-9f -0x1.0c2c12p-6f 0x1.61fdd2p-5f -0x1.3556b0p-4f 0x1.b4e138p-4f -0x1.230ae2p-3f 0x1.9978eep-3f -0x1.5554dap-2f 0.75:1
to check what it reports for njuffa's coefficient set, compared against standard C library atan() function, with all possible x in [0.75, 1] considered.
Instead of a fixed coefficient, you can also use min:max to define a range to scan (scanning all unique single-precision floating-point values). Each possible combination of the coefficients is tested.
Because I prefer decimal notation, but need to keep the values exact, I use the f2s() function to display the floating-point values. It is a simple brute-force helper function, that uses the shortest formatting that yields the same value when parsed back to float.
For example,
./example 0x1.7ed248p-9f:0x1.7ed24cp-9f -0x1.0c2c10p-6f:-0x1.0c2c14p-6f 0x1.61fdd0p-5f:0x1.61fdd4p-5f -0x1.3556aep-4f:-0x1.3556b2p-4f 0x1.b4e136p-4f:0x1.b4e13ap-4f -0x1.230ae0p-3f:-0x1.230ae4p-3f 0x1.9978ecp-3f:0x1.9978f0p-3f -0x1.5554d8p-2f:-0x1.5554dcp-2f 0.75:1
computes all the 6561 (38) coefficient combinations ±1 ULP around njuffa's set for x in [0.75, 1]. (Indeed, it shows that decreasing C17 by 1 ULP to 0x1.7ed248p-9f yields the exact same results.)
(That run took 90 seconds on Core i5-4200U at 2.6 GHz -- pretty much in line in my estimate of 30 coefficient sets per second per GHz per core. While this code is not threaded, the key functions are thread-safe, so threading should not be too difficult. This Core i5-4200U is a laptop, and gets pretty hot even when stressing just one core, so I didn't bother.)
(I consider the above code to be in public domain, or CC0-licensed where public domain dedication is not possible. In fact, I'm not sure if it is creative enough to be copyrightable at all. Anyway, feel free to use it anywhere in any way you wish, as long as you don't blame me if it breaks.)
Questions? Enhancements? Edits to fix Linux/GCC'isms are welcome!

Related

Accurate computation of principal branch of the Lambert W function with standard C math library

The Lambert W function is the inverse function of f(w) = wew. It is a multi-valued function that has infinitely many branches over the complex numbers, but has only two branches over the real numbers, denoted by W0 and W-1. W0 is considered the principal branch, with an input
domain of [-1/e, ∞] and W-1 has input domain of (-1/e, 0). Corresponding implementations are often called
lambert_w0() and lambert_wm1().
A close relative of the function was first identified by Leonard Euler [1], when following up on work by Johann Heinrich Lambert [2]. Euler examined the solution of transcendental equations xα-xβ = (α - β) v xα+β and in the process considered a simplified case ln x = v xα. In the course of this he introduced a helper function with the following series expansion around zero:
y = 1 + (21/(1·2))u + (32/(1·2·3))u2 + (43/(1·2·3·4))u3 + (54/((1·2 … ·5))u4 + …
In modern terms this function (unnamed by Gauss) represents W(-x)/x and the solution of ln x = v xα is x=(-W(-αv)/-αv)(1/α).
While the Lambert W function made an occasional appearance in the literature, e.g. [3] it was not named and recognized as an important building block until the seminal work of Robert Corless in the 1990s, e.g. [4]. Subsequently the applicability of the Lambert W function to both mathematics and the physical sciences has been expanded through ongoing research, with some examples given in [5].
The Lambert W function is not currently part of the standard math library of ISO-C, nor do here seem to be any immediate plans to add it. *How can the principal branch of the Lambert W function, W0, be implemented accurately using the ISO-C standard math library?
A faithfully-rounded implementation is probably overly ambitious, but maintaining a 4 ulp error bound (as chosen by the LA profile of the Intel math library) seems achievable and desirable. Support for IEEE-754 (2008) binary floating-point arithmetic and support for fused multiply-add (FMA) operations accessible via the fma() and fmaf() standard library functions can be assumed.
[1] Leonard Euler, “De serie Lambertina plurimisque eius insignibus proprietatibus,” (On Lambert’s series and its many distinctive properties) Acta Academiae Scientiarum Imperialis Petropolitanae pro Anno MDCCLXXIX, Tomus III, Pars II, (Proceedings of the Imperial Academy of Sciences of St. Petersburg for the Year 1779, volume 3, part 2, Jul. - Dec.), St. Petersburg: Academy of Sciences 1783, pp. 29-51 (scan online at Bavarian State Library, Munich)
[2] Johann Heinrich Lambert, "Observationes variae in mathesin puram" (Various observations on pure mathematics) Acta Helveticae physico-mathematico-anatomico-botanico-medica, Vol. 3, Basel: J. R. Imhof 1758, pp. 128-168 (scan online at Biodiversity Heritage Library)
[3] F.N. Fritsch, R.E. Shafer, and W.P. Crowley, "Algorithm 443: Solution of the Transcendental Equation wex=x", Communications of the ACM, Vol. 16, No. 2, February 1973, pp. 123-124.
[4] R.M. Corless, et al., "On the Lambert W function," Advances in computational mathematics, Vol. 5, No. 1, December 1996, pp. 329-359
[5] Iordanis Kesisoglou, Garima Singh, and Michael Nikolaou, "The Lambert Function Should Be in the Engineering Mathematical Toolbox", Computer & Chemical Engineering, 148, May 2021
From the literature it is clear that the most common method of computing the Lambert W function over the real numbers is via functional iteration. The second-order Newton method is the simplest of these schemes:
wi+1 = wi - (wi exp(wi) - x) /
(exp (wi) + wiexp(wi)).
Much of the literature prefers higher order methods, such as those by Halley, Fritsch, and Schroeder. During exploratory work I found that when performed in finite-precision floating-point arithmetic, the numerical properties of these higher-order iterations are not as favorable as the order may suggests. As a particular example, three Newton iterations consistently outperformed two Halley iterations in terms of accuracy. For this reason I settled on Newton iteration as the main building block, and used custom implementations of exp() that only need to delivers results that are representable as positive normals in IEEE-754 terminology to gain back some performance.
I further found that the Newton iteration will converge only slowly for large operands, in particular when the starting approximation is not very accurate. Since high iteration count is not conducive to good performance, I looked around for an alternative and found a superior candidate in the logarithm-based iteration scheme by Iacono and Boyd[*], which also has second order convergence:
wi+1 = (wi / (1 + wi)) * (1 + log (x / wi)
Many implementations of the Lambert W function appear to be using different starting approximations for different portions of the input domain. My preference was for a single starting approximation across the entire input domain with a view to future vectorized implementations.
Luckily Iacono and Boyd also provide a universal starting approximation that works across the entire input domain of W0, which, while not entirely living up to its promises, performs very well. I fine-tuned this for the single-precision implementation which deals with a much narrower input domain, using an optimizing search heuristic to achieve the best possible accuracy. I also employed custom implementations of log() that only have to deal with inputs that are positive normals.
Some care must be taken in both starting approximation and the Newton iteration to avoid overflow and underflow in intermediate computation. This is easily and cheaply accomplished by scaling by suitable powers of two.
While the resulting iteration schemes deliver accurate results in general, errors of many ulps occur
for argument near zero and for arguments near -1/e ≈ -0.367879. I addressed the former issue by using the first few terms of the Taylor series expansion around zero: x - x2 + (3/2)x3. The fact that W0 ≈ √(1+ex)-1 on [-1/e, 0] suggests the use of a minimax polynomial approximation p(t) with t=√(x+1/e) which turns out to work reasonably well near -1/e. I generated this approximation with the Remez algorithm.
The accuracy achieved for both IEEE-754 binary32 mapped to float, and IEEE-754 binary64 mapped to double is well within the specified error bound. Maximum error in the positive half-plane is less than 1.5 ulps, and maximum error in the negative half-plane is below 2.7 ulps. The single-precision code was tested exhaustively, while the double-precision code was tested with one billion random arguments.
[*] Roberto Iacono and John P. Boyd. "New approximations to the principal real-valued branch of the Lambert W-function." Advances in Computational Mathematics, Vol. 43, No. 6 , December 2017, pp. 1403-1436.
The single-precision implementation of the Lambert W0 function is as follows:
float expf_scale_pos_normal (float a, int scale);
float logf_pos_normal (float a);
/*
Compute the principal branch of the Lambert W function, W_0. The maximum
error in the positive half-plane is 1.49874 ulps and the maximum error in
the negative half-plane is 2.56002 ulps
*/
float lambert_w0f (float z)
{
const float em1_fact_0 = 0.625529587f; // exp(-1)_factor_0
const float em1_fact_1 = 0.588108778f; // exp(-1)_factor_1
const float qe1 = 2.71828183f / 4.0f; // 0x1.5bf0a8p-01 // exp(1)/4
float e, w, num, den, rden, redz, y, r;
if (isnan (z) || (z == INFINITY) || (z == 0.0f)) return z + z;
if (fabsf (z) < 1.220703125e-4f) return fmaf (-z, z, z); // 0x1.0p-13
redz = fmaf (em1_fact_0, em1_fact_1, z); // z + exp(-1)
if (redz < 0.0625f) { // expansion at -(exp(-1))
r = sqrtf (redz);
w = -1.23046875f; // -0x1.3b0000p+0
w = fmaf (w, r, 2.17185670f); // 0x1.15ff66p+1
w = fmaf (w, r, -2.19554094f); // -0x1.19077cp+1
w = fmaf (w, r, 1.92107077f); // 0x1.ebcb4cp+0
w = fmaf (w, r, -1.81141856f); // -0x1.cfb920p+0
w = fmaf (w, r, 2.33162979f); // 0x1.2a72d8p+1
w = fmaf (w, r, -1.00000000f); // -0x1.000000p+0
} else {
/* Compute initial approximation. Based on: Roberto Iacono and John
Philip Boyd, "New approximations to the principal real-valued branch
of the Lambert W function", Advances in Computational Mathematics,
Vol. 43, No. 6, December 2017, pp. 1403-1436
*/
y = fmaf (2.0f, sqrtf (fmaf (qe1, z, 0.25f)), 1.0f);
y = logf_pos_normal (fmaf (1.15262585f, y, -0.15262585f) /
fmaf (0.45906518f, logf_pos_normal (y), 1.0f));
w = fmaf (2.0390625f, y, -1.0f);
/* perform Newton iterations to refine approximation to full accuracy */
for (int i = 0; i < 3; i++) {
e = expf_scale_pos_normal (w, -3); // 0.125f * expf (w);
num = fmaf (w, e, -0.125f * z);
den = fmaf (w, e, e);
rden = 1.0f / den;
w = fmaf (-num, rden, w);
}
}
return w;
}
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
/* exp(a) * 2**scale; positive normal results only! Maximum error 0.86565 ulp */
float expf_scale_pos_normal (float a, int scale)
{
const float flt_int_cvt = 12582912.0f; // 0x1.8p23
float f, r, j, t;
uint32_t i;
/* exp(a) = 2**i * exp(f); i = rintf (a / log(2)) */
j = fmaf (1.442695f, a, flt_int_cvt); // // 0x1.715476p0 // log2(e)
t = j - flt_int_cvt;
f = fmaf (t, -6.93145752e-1f, a); // -0x1.62e400p-1 // log_2_hi
f = fmaf (t, -1.42860677e-6f, f); // -0x1.7f7d1cp-20 // log_2_lo
i = float_as_uint32 (j);
/* approximate r = exp(f) on interval [-log(2)/2, +log(2)/2] */
r = 1.37805939e-3f; // 0x1.694000p-10
r = fmaf (r, f, 8.37312452e-3f); // 0x1.125edcp-7
r = fmaf (r, f, 4.16695364e-2f); // 0x1.555b5ap-5
r = fmaf (r, f, 1.66664720e-1f); // 0x1.555450p-3
r = fmaf (r, f, 4.99999851e-1f); // 0x1.fffff6p-2
r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
/* exp(a) = 2**(i+scale) * r; */
r = uint32_as_float (((i + scale) << 23) + float_as_uint32 (r));
return r;
}
/* compute natural logarithm of positive normals; maximum error: 0.85089 ulp */
float logf_pos_normal (float a)
{
const float ln2 = 0.693147182f; // 0x1.62e430p-1 // log(2)
const float two_to_m23 = 1.19209290e-7f; // 0x1.0p-23
float m, r, s, t, i, f;
int32_t e;
/* log(a) = log(m * 2**i) = i * log(2) + log(m) */
e = (float_as_uint32 (a) - float_as_uint32 (0.666666667f)) & 0xff800000;
m = uint32_as_float (float_as_uint32 (a) - e);
i = (float)e * two_to_m23;
/* log(m) = log1p(f) */
f = m - 1.0f;
s = f * f;
/* compute log1p(f) for f in [-1/3, 1/3] */
r = -0.130310059f; // -0x1.0ae000p-3
t = 0.140869141f; // 0x1.208000p-3
r = fmaf (r, s, -0.121483363f); // -0x1.f1988ap-4
t = fmaf (t, s, 0.139814854f); // 0x1.1e5740p-3
r = fmaf (r, s, -0.166846141f); // -0x1.55b36ep-3
t = fmaf (t, s, 0.200120345f); // 0x1.99d8b2p-3
r = fmaf (r, s, -0.249996200f); // -0x1.fffe02p-3
r = fmaf (t, f, r);
r = fmaf (r, f, 0.333331972f); // 0x1.5554fap-2
r = fmaf (r, f, -0.500000000f); // -0x1.000000p-1
r = fmaf (r, s, f);
/* log(a) = i * log(2) + log(m) */
r = fmaf (i, ln2, r);
return r;
}
The double-precision implementation is structurally equivalent to the single-precision implementation, except that it makes use of the Iacono-Boyd iteration scheme:
double exp_scale_pos_normal (double a, int scale);
double log_pos_normal (double a);
/* Compute the principal branch of the Lambert W function, W_0. Maximum error:
positive half-plane: 1.49210 ulp
negative half-plane: 2.67824 ulp
*/
double lambert_w0 (double z)
{
const double em1_fact_0 = 0.57086272525975246; // 0x1.24481e7efdfcep-1 // exp(-1)_factor_0
const double em1_fact_1 = 0.64442715366299452; // 0x1.49f25b1b461b7p-1 // exp(-1)_factor_1
const double qe1 = 2.7182818284590452 * 0.25; // 0x1.5bf0a8b145769p-1 // exp(1)/4
double e, r, t, w, y, num, den, rden, redz;
int i;
if (isnan (z) || (z == INFINITY) || (z == 0.0)) return z + z;
if (fabs (z) < 1.9073486328125e-6) return fma (fma (1.5, z, -1.) * z, z, z);
redz = fma (em1_fact_0, em1_fact_1, z); // z + exp(-1)
if (redz < 0.01025390625) { // expansion at -(exp(-1))
r = sqrt (redz);
w = -7.8466654751155138; // -0x1.f62fc463917ffp+2
w = fma (w, r, 10.0241581340373877); // 0x1.40c5e74773ef5p+3
w = fma (w, r, -8.1029379749359691); // -0x1.034b44947bba0p+3
w = fma (w, r, 5.8322883145113726); // 0x1.75443634ead5fp+2
w = fma (w, r, -4.1738796362609882); // -0x1.0b20d80dcb9acp+2
w = fma (w, r, 3.0668053943936471); // 0x1.888d14440efd0p+1
w = fma (w, r, -2.3535499689514934); // -0x1.2d41201913016p+1
w = fma (w, r, 1.9366310979331112); // 0x1.efc70e3e0a0eap+0
w = fma (w, r, -1.8121878855270763); // -0x1.cfeb8b968bd2cp+0
w = fma (w, r, 2.3316439815968506); // 0x1.2a734f5b6fd56p+1
w = fma (w, r, -1.0000000000000000); // -0x1.0000000000000p+0
return w;
}
/* Roberto Iacono and John Philip Boyd, "New approximations to the
principal real-valued branch of the Lambert W function", Advances
in Computational Mathematics, Vol. 43, No. 6, December 2017,
pp. 1403-1436
*/
y = fma (2.0, sqrt (fma (qe1, z, 0.25)), 1.0);
y = log_pos_normal (fma (1.14956131, y, -0.14956131) /
fma (0.4549574, log_pos_normal (y), 1.0));
w = fma (2.036, y, -1.0);
/* Use iteration scheme w = (w / (1 + w)) * (1 + log (z / w) from
Roberto Iacono and John Philip Boyd, "New approximations to the
principal real-valued branch of the Lambert W function", Advances
in Computational Mathematics, Vol. 43, No. 6, December 2017, pp.
1403-1436
*/
for (i = 0; i < 3; i++) {
t = w / (1.0 + w);
w = fma (log_pos_normal (z / w), t, t);
}
/* Fine tune approximation with a single Newton iteration */
e = exp_scale_pos_normal (w, -3); // 0.125 * exp (w)
num = fma (w, e, -0.125 *z);
den = fma (w, e, e);
rden = 1.0 / den;
w = fma (-num, rden, w);
return w;
}
int double2hiint (double a)
{
unsigned long long int t;
memcpy (&t, &a, sizeof t);
return (int)(t >> 32);
}
int double2loint (double a)
{
unsigned long long int t;
memcpy (&t, &a, sizeof t);
return (int)(unsigned int)t;
}
double hiloint2double (int hi, int lo)
{
double r;
unsigned long long int t;
t = ((unsigned long long int)(unsigned int)hi << 32) | (unsigned int)lo;
memcpy (&r, &t, sizeof r);
return r;
}
/* exp(a) * 2**scale; pos. normal results only! Max. err. found: 0.89028 ulp */
double exp_scale_pos_normal (double a, int scale)
{
const double ln2_hi = 6.9314718055829871e-01; // 0x1.62e42fefa00000p-01
const double ln2_lo = 1.6465949582897082e-12; // 0x1.cf79abc9e3b3a0p-40
const double l2e = 1.4426950408889634; // 0x1.71547652b82fe0p+00 // log2(e)
const double cvt = 6755399441055744.0; // 0x1.80000000000000p+52 // 3*2**51
double f, r;
int i;
/* exp(a) = exp(i + f); i = rint (a / log(2)) */
r = fma (l2e, a, cvt);
i = double2loint (r);
r = r - cvt;
f = fma (r, -ln2_hi, a);
f = fma (r, -ln2_lo, f);
/* approximate r = exp(f) on interval [-log(2)/2,+log(2)/2] */
r = 2.5022018235176802e-8; // 0x1.ade0000000000p-26
r = fma (r, f, 2.7630903497145818e-7); // 0x1.28af3fcbbf09bp-22
r = fma (r, f, 2.7557514543490574e-6); // 0x1.71dee623774fap-19
r = fma (r, f, 2.4801491039409158e-5); // 0x1.a01997c8b50d7p-16
r = fma (r, f, 1.9841269589068419e-4); // 0x1.a01a01475db8cp-13
r = fma (r, f, 1.3888888945916566e-3); // 0x1.6c16c1852b805p-10
r = fma (r, f, 8.3333333334557735e-3); // 0x1.11111111224c7p-7
r = fma (r, f, 4.1666666666519782e-2); // 0x1.55555555502a5p-5
r = fma (r, f, 1.6666666666666477e-1); // 0x1.5555555555511p-3
r = fma (r, f, 5.0000000000000122e-1); // 0x1.000000000000bp-1
r = fma (r, f, 1.0000000000000000e+0); // 0x1.0000000000000p+0
r = fma (r, f, 1.0000000000000000e+0); // 0x1.0000000000000p+0
/* exp(a) = 2**(i+scale) * r */
r = hiloint2double (double2hiint (r) + ((i + scale) << 20),
double2loint (r));
return r;
}
/* compute natural logarithm of positive normals; max. err. found: 0.86902 ulp*/
double log_pos_normal (double a)
{
const double ln2_hi = 6.9314718055994529e-01; // 0x1.62e42fefa39efp-01
const double ln2_lo = 2.3190468138462996e-17; // 0x1.abc9e3b39803fp-56
double m, r, i, s, t, p, q;
int e;
/* log(a) = log(m * 2**i) = i * log(2) + log(m) */
e = (double2hiint (a) - double2hiint (0.70703125)) & 0xfff00000;
m = hiloint2double (double2hiint (a) - e, double2loint (a));
t = hiloint2double (0x41f00000, 0x80000000 ^ e);
i = t - (hiloint2double (0x41f00000, 0x80000000));
/* m now in [181/256, 362/256]. Compute q = (m-1) / (m+1) */
p = m + 1.0;
r = 1.0 / p;
q = fma (m, r, -r);
m = m - 1.0;
/* compute (2*atanh(q)/q-2*q) as p(q**2), q in [-75/437, 53/309] */
s = q * q;
r = 1.4794533702196025e-1; // 0x1.2efdf700d7135p-3
r = fma (r, s, 1.5314187748152339e-1); // 0x1.39a272db730f7p-3
r = fma (r, s, 1.8183559141306990e-1); // 0x1.746637f2f191bp-3
r = fma (r, s, 2.2222198669309609e-1); // 0x1.c71c522a64577p-3
r = fma (r, s, 2.8571428741489319e-1); // 0x1.24924941c9a2fp-2
r = fma (r, s, 3.9999999999418523e-1); // 0x1.999999998006cp-2
r = fma (r, s, 6.6666666666667340e-1); // 0x1.5555555555592p-1
r = r * s;
/* log(a) = 2*atanh(q) + i*log(2) = ln2_lo*i + p(q**2)*q + 2q + ln2_hi * i.
Use K.C. Ng's trick to improve the accuracy of the computation, like so:
p(q**2)*q + 2q = p(q**2)*q + q*t - t + m, where t = m**2/2.
*/
t = m * m * 0.5;
r = fma (q, t, fma (q, r, ln2_lo * i)) - t + m;
r = fma (ln2_hi, i, r);
return r;
}

IEEE 754 conformant sqrtf() implementation taking into account hardware restrictions and usage limitations

Follow-up question for IEEE 754 conformant sqrt() implementation for double type.
Context: Need to implement IEEE 754 conformant sqrtf() taking into account the following HW restrictions and usage limitations:
Provides a special instruction qseed.f to get an approximation of the reciprocal of the square root (the accuracy of the result is no less than 6.75 bits, and therefore always within ±1% of the accurate result).
Single precision FP:
a. Support by HW (SP FPU): has support;
b. Support by SW (library): has support;
c. Support of subnormal numbers: no support (FLT_HAS_SUBNORM is 0).
Double precision FP:
a. Support by HW (DP FPU): no support;
b. Support by SW (library): has support;
c. Support of subnormal numbers: no support (DBL_HAS_SUBNORM is 0).
I've found one presentation by John Harrison and ended up with this implementation (note that here qseed.f is replaced by rsqrtf()):
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
// https://github.com/nickzman/hyperspace/blob/master/frsqrt.hh
#if 1
float rsqrtf ( float x )
{
const float xhalf = 0.5f * x;
int i = *(int*) & x;
i = 0x5f375a86 - ( i >> 1 );
x = *(float*) & i;
x = x * ( 1.5f - xhalf * x * x );
x = x * ( 1.5f - xhalf * x * x );
x = x * ( 1.5f - xhalf * x * x );
return x;
}
#else
float rsqrtf ( float x )
{
return 1.0f / sqrtf( x );
}
#endif
float sqrtfr_jh( float x, float r )
{
/*
* John Harrison, Formal Verification Methods 5: Floating Point Verification,
* Intel Corporation, 12 December 2002, document name: slides5.pdf, page 14,
* slide "The square root algorithm".
* URL: https://www.cl.cam.ac.uk/~jrh13/slides/anu-09_12dec02/slides5.pdf
*/
double rd, b, z0, s0, d, k, h0, e, t0, s1, c, d1, h1, s;
static const double half = 0.5;
static const double one = 1.0;
static const double three = 3.0;
static const double two = 2.0;
rd = (double)r;
b = half * x;
z0 = rd * rd;
s0 = x * rd;
d = fma( -b, z0, half );
k = fma( x, rd, -s0 );
h0 = half * rd;
e = fma( three / two, d, one );
t0 = fma( d, s0, k );
s1 = fma( e, t0, s0 );
c = fma( d, e, one );
d1 = fma( -s1, s1, x );
h1 = c * h0;
s = fma( d1, h1, s1 );
return (float)s;
}
float my_sqrtf( float x )
{
/* handle special cases */
if (x == 0) {
return x + x;
}
/* handle normal cases */
if ((x > 0) && (x < INFINITY)) {
return sqrtfr_jh( x, rsqrtf( x ) );
}
/* handle special cases */
return (x < 0) ? NAN : (x + x);
}
/*
https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
From: geo <gmars...#gmail.com>
Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
Subject: 64-bit KISS RNGs
Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)
This 64-bit KISS RNG has three components, each nearly
good enough to serve alone. The components are:
Multiply-With-Carry (MWC), period (2^121+2^63-1)
Xorshift (XSH), period 2^64-1
Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64 (kiss64_t = (kiss64_x << 58) + kiss64_c, \
kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64 (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
kiss64_y ^= (kiss64_y << 43))
#define CNG64 (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)
int main (void)
{
const uint64_t N = 10000000000ULL; /* desired number of test cases */
float arg, ref, res;
uint64_t argi64;
uint32_t refi, resi;
uint64_t count = 0;
float spec[] = {0.0f, 1.0f, INFINITY, NAN};
printf ("test a few special cases:\n");
for (int i = 0; i < sizeof (spec)/sizeof(spec[0]); i++) {
printf ("my_sqrt(%a) = %a\n", spec[i], my_sqrtf(spec[i]));
printf ("my_sqrt(%a) = %a\n", -spec[i], my_sqrtf(-spec[i]));
}
printf ("test %lu random cases:\n", N);
do {
argi64 = KISS64;
memcpy (&arg, &argi64, sizeof arg);
if ( fpclassify(arg) == FP_SUBNORMAL )
{
continue;
}
++count;
res = my_sqrtf (arg);
ref = sqrtf (arg);
memcpy (&resi, &res, sizeof resi);
memcpy (&refi, &ref, sizeof refi);
if ( ! ( isnan(res) && isnan(ref) ) )
if (resi != refi) {
printf ("\rerror # arg=%a (%e)\n", arg, arg);
printf ("\rerror # res=%a (%e)\n", res, res);
printf ("\rerror # ref=%a (%e)\n", ref, ref);
return EXIT_FAILURE;
}
if ((count & 0xfffff) == 0) printf ("\r[%lu]", count);
} while (count < N);
printf ("\r[%lu]", count);
printf ("\ntests PASSED\n");
return EXIT_SUCCESS;
}
And it seems to work correctly (at least for some random cases): it reports:
[10000000000]
tests PASSED
Now the question: since the original John Harrison's sqrtf() algorithm uses only single precision computations (i.e. type float), it is possible to reduce the number of operations when using only (except conversions) double precision computations (i.e. type double) and still be IEEE 754 conformant?
P.S. Since users #njuffa and #chux - Reinstate Monica are strong in FP, I invite them to participate. However, all the competent in FP users are welcome.
Computing a single-precision square root via double-precision code is going to be inefficient, especially if the hardware provides no native double-precision operations.
The following assumes hardware that conforms to IEEE-754 (2008), except that subnormals are not supported and flushed to zero. Fused-multiply add (FMA) is supported. It further assumes an ISO-C99 compiler that maps float to IEEE-754 binary32, and that maps the hardware's single-precision FMA instruction to the standard math function fmaf().
From a hardware starting approximation for the reciprocal square root with a maximum relative error of 2-6.75 one can get to a reciprocal square root accurate to 1 single-precision ulp with two Newton-Raphson iterations. Multiplying this with the original argument provides an accurate estimate of the square root. The square of this approximation is subtracted from the orginal argument to compute the approximation error for the square root. This error is then used to apply a correction to the square root approximation, resulting in a correctly-rounded square root.
However, this straightforward algorithm breaks down for arguments that are very small due to underflow or overflow in intermediate computation, in particular when the underlying arithmetic operates in flash-to-zero mode that flushes subnormals to zero. For such arguments we can construct a slowpath code that scales the input towards unity, and scales back the result accordingly once the square root has been computed. Code for handling special operands such as zeros, infinities, NaNs, and negative arguments other than zero is also added to this slowpath code.
The NaN generated by the slowpath code for invalid operations should be adjusted to match the system's existing operations. For example, for x86-based systems this would be a special QNaN called INDEFINITE, with a bit pattern of 0xffc00000, while for a GPU running CUDA it would be the canonical single-precision NaN with a bit pattern of 0x7fffffff.
For performance reasons it may be useful to inline the fastpath code while making the slowpath code a called outlined subroutine. Single-precision math functions with a single argument should always be tested exhaustively against a "golden" reference implementation, which takes just minutes on modern hardware.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
float uint32_as_float (uint32_t);
uint32_t float_as_uint32 (float);
float qseedf (float);
float sqrtf_slowpath (float);
/* Square root computation for IEEE-754 binary32 mapped to 'float' */
float my_sqrtf (float arg)
{
const uint32_t upper = float_as_uint32 (0x1.fffffep+127f);
const uint32_t lower = float_as_uint32 (0x1.000000p-102f);
float rsq, sqt, err;
/* use fastpath computation if argument in [0x1.0p-102, 0x1.0p+128) */
if ((float_as_uint32 (arg) - lower) <= (upper - lower)) {
/* generate low-accuracy approximation to rsqrt(arg) */
rsq = qseedf (arg);
/* apply two Newton-Raphson iterations with quadratic convergence */
rsq = fmaf (fmaf (-0.5f * arg * rsq, rsq, 0.5f), rsq, rsq);
rsq = fmaf (fmaf (-0.5f * arg * rsq, rsq, 0.5f), rsq, rsq);
/* compute sqrt from rsqrt, round result to nearest or even */
sqt = rsq * arg;
err = fmaf (sqt, -sqt, arg);
sqt = fmaf (0.5f * rsq, err, sqt);
} else {
sqt = sqrtf_slowpath (arg);
}
return sqt;
}
/* interprete bit pattern of 32-bit unsigned integer as IEEE-754 binary32 */
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
/* interprete bit pattern of IEEE-754 binary32 as a 32-bit unsigned integer */
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
/* simulate low-accuracy hardware approximation to 1/sqrt(a) */
float qseedf (float a)
{
float r = 1.0f / sqrtf (a);
r = uint32_as_float (float_as_uint32 (r) & ~0x1ffff);
return r;
}
/* square root computation suitable for all IEEE-754 binary32 arguments */
float sqrtf_slowpath (float arg)
{
const float FP32_INFINITY = uint32_as_float (0x7f800000);
const float FP32_QNAN = uint32_as_float (0xffc00000); /* system specific */
const float scale_in = 0x1.0p+26f;
const float scale_out = 0x1.0p-13f;
float rsq, err, sqt;
if (arg < 0.0f) {
return FP32_QNAN;
} else if ((arg == 0.0f) || !(fabsf (arg) < FP32_INFINITY)) { /* Inf, NaN */
return arg + arg;
} else {
/* scale subnormal arguments towards unity */
arg = arg * scale_in;
/* generate low-accuracy approximation to rsqrt(arg) */
rsq = qseedf (arg);
/* apply two Newton-Raphson iterations with quadratic convergence */
rsq = fmaf (fmaf (-0.5f * arg * rsq, rsq, 0.5f), rsq, rsq);
rsq = fmaf (fmaf (-0.5f * arg * rsq, rsq, 0.5f), rsq, rsq);
/* compute sqrt from rsqrt, round to nearest or even */
sqt = rsq * arg;
err = fmaf (sqt, -sqt, arg);
sqt = fmaf (0.5f * rsq, err, sqt);
/* compensate scaling of argument by counter-scaling the result */
sqt = sqt * scale_out;
return sqt;
}
}
int main (void)
{
uint32_t ai, resi, refi;
float a, res, reff;
double ref;
ai = 0x00000000;
do {
a = uint32_as_float (ai);
res = my_sqrtf (a);
ref = sqrt ((double)a);
reff = (float)ref;
resi = float_as_uint32 (res);
refi = float_as_uint32 (reff);
if (resi != refi) {
printf ("error # %08x %15.8e res=%08x %15.8e ref=%08x %15.8e\n",
ai, a, resi, res, refi, reff);
return EXIT_FAILURE;
}
ai++;
} while (ai);
return EXIT_SUCCESS;
}

Single precision argument reduction for trigonometric functions in C

I have implemented some approximations for trigonometric functions (sin,cos,arctan) computed with single precision (32 bit floating point) in C. They are accurate to about +/- 2 ulp.
My target device does not support any <cmath> or <math.h> methods. It does not provide a FMA, but a MAC ALU. ALU and LU compute in 32 bit format.
My arctan approximation is actually a modified version of the approximation of N.juffa, which approximates arctan on the full range. Sine and cosine function are accurate up to 2 ulp within the range [-pi,pi].
I am now aiming to provide a larger input range (as large as possible, ideally [FLT_MIN,FLT_MAX]) for sine and cosine, which leads me to argument reduction.
I'm currently reading different papers like ARGUMENT REDUCTION FOR HUGE ARGUMENTS:
Good to the Last Bit by K.C.Ng or the paper about this new argument reduction algorithm, but I wasn't able to derive an implementation from it.
Also I want to mention two stackoverflow questions that refer to related problems: There is a approach with matlab and c++ which is based on the first paper I linked. It is actually using matlab, cmath methods and it limits the input to [0,20.000]. The other one was already mentioned in the comments. It is an approach to an implementation of sin and cos in C, using various c-libraries which are not available for me. Since both posts are already several years old, there might be some new findings.
It seems like the algorithm mostly used in this case is to store the number of 2/pi accurate up to the needed number of bits, to be able to compute the modulo calculation accurately and simultaneously avoid cancellation. My device does not provide a large DMEM, which means large look-up tables with hundreds of bits are not possible. This procedure is actually described on page 70 of this reference, which by the way provides a lot of useful informatin about floating point math.
So my question is: Is there another efficient way to reduce the arguments for sine and cosine obtaining single precision avoiding large LUTs? The papers mentioned above actually focus on double precision and use up to 1000 digits, which is not suitable for my usecase.
I actually haven't found any implementation in C nor an implementation aiming single precision calculation, I would be grateful for any sorts of hints /links /examples...
The following code is based on a previous answer in which I demonstrated how to perform a fairly accurate argument reduction for trigonometric functions by using the Cody-Waite method of split constants for arguments small in magnitude, and the Payne-Hanek method for arguments large in magnitude. For details on the Payne-Hanek algorithm see there, for details on the Cody-Waite algorithm see this previous answer of mine.
Here I have made adjustments necessary to adjust to the restrictions of the asker's platform, in that no 64-bit types are supported, fused multiply-add is not supported, and helper functions from math.h are not available. I am assuming that float maps to IEEE-754 binary32 format, and that there is a way to re-interpret such a 32-bit float as a 32-bit unsigned integer and vice versa. I have implemented this re-interpretation via the standard portable idiom, that is, by using memcpy(), but other methods may be chosen appropriate for the unspecified target platform, such as inline assembly, machine-specific intrinsics, or volatile unions.
Since this code is basically a port of my previous code to a more restrictive environment, it lacks perhaps the elegance of a de novo design specifically targeted at that environment. I have basically replaced the frexp() helper function from math.h with some bit twiddling, emulated 64-bit integer computation with pairs of 32-bit integers, replaced the double-precision computation with 32-bit fixed-point computation (which worked much better than I had anticipated), and replaced all FMAs with the unfused equivalent.
Re-working the Cody-Waite portion of the argument reduction took quite a bit of work. Clearly, without FMA available, we need to ensure a sufficient number of trailing zero bits in the constituent parts of the constant π/2 (except the least significant one) to make sure the products are exact. I spent several hours experimentally puzzling out a particular split that delivers accurate results but also pushes the switchover point to the Payne-Hanek method as high as possible.
When USE_FMA = 1 is specified, the output of the test app, when compiled with a high-quality math library, should look similar to this:
Testing sinf ... PASSED. max ulp err = 1.493253 diffsum = 337633490
Testing cosf ... PASSED. max ulp err = 1.495098 diffsum = 342020968
With USE_FMA = 0 the accuracy changes slightly for the worse:
Testing sinf ... PASSED. max ulp err = 1.498012 diffsum = 359702532
Testing cosf ... PASSED. max ulp err = 1.504061 diffsum = 364682650
The diffsum output is a rough indicator of overall accuracy, here showing that about 90% of all inputs result in a correctly rounded single-precision response.
Note that it is important to compile the code with the strictest floating-point settings and highest degree of adherence to IEEE-754 the compiler offers. For the Intel compiler that I used to develop and test this code, that can be achieved by compiling with /fp:strict. Also, the quality of the math library used for reference is crucial for accurate assessment of the ulp error of this single-precision code. The Intel compiler comes with a math library that provides double-precision elementary math functions with just slightly over 0.5 ulp error in the HA (high accuracy) variant. Use of a multi-precision reference library may be preferable but would have slowed me down too much here.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h> // for memcpy()
#include <math.h> // for test purposes, and when PORTABLE=1 or USE_FMA=1
#define USE_FMA (0) // use fmaf() calls for arithmetic
#define PORTABLE (0) // allow helper functions from math.h
#define HAVE_U64 (0) // 64-bit integer type available
#define CW_STAGES (3) // number of stages in Cody-Waite reduction when USE_FMA=0
#if USE_FMA
#define SIN_RED_SWITCHOVER (117435.992f)
#define COS_RED_SWITCHOVER (71476.0625f)
#define MAX_DIFF (1)
#else // USE_FMA
#if CW_STAGES == 2
#define SIN_RED_SWITCHOVER (3.921875f)
#define COS_RED_SWITCHOVER (3.921875f)
#elif CW_STAGES == 3
#define SIN_RED_SWITCHOVER (201.15625f)
#define COS_RED_SWITCHOVER (142.90625f)
#endif // CW_STAGES
#define MAX_DIFF (2)
#endif // USE_FMA
/* re-interpret the bit pattern of an IEEE-754 float as a uint32 */
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
/* re-interpret the bit pattern of a uint32 as an IEEE-754 float */
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
/* Compute the upper 32 bits of the product of two unsigned 32-bit integers */
#if HAVE_U64
uint32_t umul32_hi (uint32_t a, uint32_t b)
{
return (uint32_t)(((uint64_t)a * b) >> 32);
}
#else // HAVE_U64
/* Henry S. Warren, "Hacker's Delight, 2nd ed.", Addison-Wesley 2012. Fig. 8-2 */
uint32_t umul32_hi (uint32_t a, uint32_t b)
{
uint16_t a_lo = (uint16_t)a;
uint16_t a_hi = a >> 16;
uint16_t b_lo = (uint16_t)b;
uint16_t b_hi = b >> 16;
uint32_t p0 = (uint32_t)a_lo * b_lo;
uint32_t p1 = (uint32_t)a_lo * b_hi;
uint32_t p2 = (uint32_t)a_hi * b_lo;
uint32_t p3 = (uint32_t)a_hi * b_hi;
uint32_t t = (p0 >> 16) + p1;
return (t >> 16) + (((uint32_t)(uint16_t)t + p2) >> 16) + p3;
}
#endif // HAVE_U64
/* 190 bits of 2/PI for Payne-Hanek style argument reduction. */
const uint32_t two_over_pi_f [] =
{
0x28be60db,
0x9391054a,
0x7f09d5f4,
0x7d4d3770,
0x36d8a566,
0x4f10e410
};
/* Reduce a trig function argument using the slow Payne-Hanek method */
float trig_red_slowpath_f (float a, int *quadrant)
{
uint32_t ia, hi, mid, lo, tmp, i, l, h, plo, phi;
int32_t e, q;
float r;
#if PORTABLE
ia = (uint32_t)(fabsf (frexpf (a, &e)) * 4.29496730e+9f); // 0x1.0p32
#else // PORTABLE
ia = ((float_as_uint32 (a) & 0x007fffff) << 8) | 0x80000000;
e = ((float_as_uint32 (a) >> 23) & 0xff) - 126;
#endif // PORTABLE
/* compute product x * 2/pi in 2.62 fixed-point format */
i = (uint32_t)e >> 5;
e = (uint32_t)e & 31;
hi = i ? two_over_pi_f [i-1] : 0;
mid = two_over_pi_f [i+0];
lo = two_over_pi_f [i+1];
tmp = two_over_pi_f [i+2];
if (e) {
hi = (hi << e) | (mid >> (32 - e));
mid = (mid << e) | (lo >> (32 - e));
lo = (lo << e) | (tmp >> (32 - e));
}
/* compute 64-bit product phi:plo */
phi = 0;
l = ia * lo;
h = umul32_hi (ia, lo);
plo = phi + l;
phi = h + (plo < l);
l = ia * mid;
h = umul32_hi (ia, mid);
plo = phi + l;
phi = h + (plo < l);
l = ia * hi;
phi = phi + l;
/* split fixed-point result into integer and fraction portions */
q = phi >> 30; // integral portion = quadrant<1:0>
phi = phi & 0x3fffffff; // fraction
if (phi & 0x20000000) { // fraction >= 0.5
phi = phi - 0x40000000; // fraction - 1.0
q = q + 1;
}
/* compute remainder of x / (pi/2) */
#if USE_FMA
float phif, plof, chif, clof, thif, tlof;
phif = 1.34217728e+8f * (float)(int32_t)(phi & 0xffffffe0); // 0x1.0p27
plof = (float)((plo >> 5) | (phi << (32-5)));
thif = phif + plof;
plof = (phif - thif) + plof;
phif = thif;
chif = 1.08995894e-17f; // 0x1.921fb6p-57 // (1.5707963267948966 * 0x1.0p-57)_hi
clof = -3.03308686e-25f; // -0x1.777a5cp-82 // (1.5707963267948966 * 0x1.0p-57)_lo
thif = phif * chif;
tlof = fmaf (phif, chif, -thif);
tlof = fmaf (phif, clof, tlof);
tlof = fmaf (plof, chif, tlof);
r = thif + tlof;
#else // USE_FMA
/* record sign of fraction */
uint32_t s = phi & 0x80000000;
/* take absolute value of fraction */
if ((int32_t)phi < 0) {
phi = ~phi;
plo = 0 - plo;
phi += (plo == 0);
}
/* normalize fraction */
e = 0;
while ((int32_t)phi > 0) {
phi = (phi << 1) | (plo >> 31);
plo = plo << 1;
e--;
}
/* multiply 32 high-order bits of fraction with pi/2 */
phi = umul32_hi (phi, 0xc90fdaa2); // (uint32_t)rint(PI/2 * 2**31)
/* normalize product */
if ((int32_t)phi > 0) {
phi = phi << 1;
e--;
}
/* round and convert to floating point */
uint32_t ri = s + ((e + 128) << 23) + (phi >> 8) + ((phi & 0xff) > 0x7e);
r = uint32_as_float (ri);
#endif // USE_FMA
if (a < 0.0f) {
r = -r;
q = -q;
}
*quadrant = q;
return r;
}
/* Argument reduction for trigonometric functions that reduces the argument
to the interval [-PI/4, +PI/4] and also returns the quadrant. It returns
-0.0f for an input of -0.0f
*/
float trig_red_f (float a, float switch_over, int *q)
{
float j, r;
if (fabsf (a) > switch_over) {
/* Payne-Hanek style reduction. M. Payne and R. Hanek, "Radian reduction
for trigonometric functions". SIGNUM Newsletter, 18:19-24, 1983
*/
r = trig_red_slowpath_f (a, q);
} else {
/* Cody-Waite style reduction. W. J. Cody and W. Waite, "Software Manual
for the Elementary Functions", Prentice-Hall 1980
*/
#if USE_FMA
j = fmaf (a, 6.36619747e-1f, 1.2582912e+7f); // 0x1.45f306p-1, 0x1.8p+23
j = j - 1.25829120e+7f; // 0x1.8p+23
r = fmaf (j, -1.57079601e+00f, a); // -0x1.921fb0p+00 // pio2_high
r = fmaf (j, -3.13916473e-07f, r); // -0x1.5110b4p-22 // pio2_mid
r = fmaf (j, -5.39030253e-15f, r); // -0x1.846988p-48 // pio2_low
#else // USE_FMA
j = (a * 6.36619747e-1f + 1.2582912e+7f); // 0x1.45f306p-1, 0x1.8p+23
j = j - 1.25829120e+7f; // 0x1.8p+23
#if CW_STAGES == 2
r = a - j * 1.57079625e+00f; // 0x1.921fb4p+0 // pio2_high
r = r - j * 7.54979013e-08f; // 0x1.4442d2p-24 // pio2_low
#elif CW_STAGES == 3
r = a - j * 1.57078552e+00f; // 0x1.921f00p+00 // pio2_high
r = r - j * 1.08043314e-05f; // 0x1.6a8880p-17 // pio2_mid
r = r - j * 2.56334407e-12f; // 0x1.68c234p-39 // pio2_low
#endif // CW_STAGES
#endif // USE_FMA
*q = (int)j;
}
return r;
}
/* Approximate sine on [-PI/4,+PI/4]. Maximum ulp error with USE_FMA = 0.64196
Returns -0.0f for an argument of -0.0f
Polynomial approximation based on T. Myklebust, "Computing accurate
Horner form approximations to special functions in finite precision
arithmetic", http://arxiv.org/abs/1508.03211, retrieved on 8/29/2016
*/
float sinf_poly (float a, float s)
{
float r, t;
#if USE_FMA
r = 2.86567956e-6f; // 0x1.80a000p-19
r = fmaf (r, s, -1.98559923e-4f); // -0x1.a0690cp-13
r = fmaf (r, s, 8.33338592e-3f); // 0x1.111182p-07
r = fmaf (r, s, -1.66666672e-1f); // -0x1.555556p-03
t = fmaf (a, s, 0.0f); // ensure -0 is passed through
r = fmaf (r, t, a);
#else // USE_FMA
r = 2.86567956e-6f; // 0x1.80a000p-19
r = r * s - 1.98559923e-4f; // -0x1.a0690cp-13
r = r * s + 8.33338592e-3f; // 0x1.111182p-07
r = r * s - 1.66666672e-1f; // -0x1.555556p-03
t = a * s + 0.0f; // ensure -0 is passed through
r = r * t + a;
#endif // USE_FMA
return r;
}
/* Approximate cosine on [-PI/4,+PI/4]. Maximum ulp error with USE_FMA = 0.87444 */
float cosf_poly (float s)
{
float r;
#if USE_FMA
r = 2.44677067e-5f; // 0x1.9a8000p-16
r = fmaf (r, s, -1.38877297e-3f); // -0x1.6c0efap-10
r = fmaf (r, s, 4.16666567e-2f); // 0x1.555550p-05
r = fmaf (r, s, -5.00000000e-1f); // -0x1.000000p-01
r = fmaf (r, s, 1.00000000e+0f); // 0x1.000000p+00
#else // USE_FMA
r = 2.44677067e-5f; // 0x1.9a8000p-16
r = r * s - 1.38877297e-3f; // -0x1.6c0efap-10
r = r * s + 4.16666567e-2f; // 0x1.555550p-05
r = r * s - 5.00000000e-1f; // -0x1.000000p-01
r = r * s + 1.00000000e+0f; // 0x1.000000p+00
#endif // USE_FMA
return r;
}
/* Map sine or cosine value based on quadrant */
float sinf_cosf_core (float a, int i)
{
float r, s;
s = a * a;
r = (i & 1) ? cosf_poly (s) : sinf_poly (a, s);
if (i & 2) {
r = 0.0f - r; // don't change "sign" of NaNs
}
return r;
}
/* maximum ulp error with USE_FMA = 1: 1.495098 */
float my_sinf (float a)
{
float r;
int i;
a = a * 0.0f + a; // inf -> NaN
r = trig_red_f (a, SIN_RED_SWITCHOVER, &i);
r = sinf_cosf_core (r, i);
return r;
}
/* maximum ulp error with USE_FMA = 1: 1.493253 */
float my_cosf (float a)
{
float r;
int i;
a = a * 0.0f + a; // inf -> NaN
r = trig_red_f (a, COS_RED_SWITCHOVER, &i);
r = sinf_cosf_core (r, i + 1);
return r;
}
/* re-interpret bit pattern of an IEEE-754 double as a uint64 */
uint64_t double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r, &a, sizeof r);
return r;
}
double floatUlpErr (float res, double ref)
{
uint64_t i, j, err, refi;
int expoRef;
/* ulp error cannot be computed if either operand is NaN, infinity, zero */
if (isnan (res) || isnan (ref) || isinf (res) || isinf (ref) ||
(res == 0.0f) || (ref == 0.0f)) {
return 0.0;
}
/* Convert the float result to an "extended float". This is like a float
with 56 instead of 24 effective mantissa bits.
*/
i = ((uint64_t)float_as_uint32(res)) << 32;
/* Convert the double reference to an "extended float". If the reference is
>= 2^129, we need to clamp to the maximum "extended float". If reference
is < 2^-126, we need to denormalize because of the float types's limited
exponent range.
*/
refi = double_as_uint64(ref);
expoRef = (int)(((refi >> 52) & 0x7ff) - 1023);
if (expoRef >= 129) {
j = 0x7fffffffffffffffULL;
} else if (expoRef < -126) {
j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
j = j >> (-(expoRef + 126));
} else {
j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
j = j | ((uint64_t)(expoRef + 127) << 55);
}
j = j | (refi & 0x8000000000000000ULL);
err = (i < j) ? (j - i) : (i - j);
return err / 4294967296.0;
}
int main (void)
{
float arg, res, reff;
uint32_t argi, resi, refi;
int64_t diff, diffsum;
double ref, ulp, maxulp;
printf ("Testing sinf ... ");
diffsum = 0;
maxulp = 0;
argi = 0;
do {
arg = uint32_as_float (argi);
res = my_sinf (arg);
ref = sin ((double)arg);
reff = (float)ref;
resi = float_as_uint32 (res);
refi = float_as_uint32 (reff);
ulp = floatUlpErr (res, ref);
if (ulp > maxulp) {
maxulp = ulp;
}
diff = (resi > refi) ? (resi - refi) : (refi - resi);
if (diff > MAX_DIFF) {
printf ("\nerror # %08x (% 15.8e): res=%08x (% 15.8e) ref=%08x (%15.8e)\n", argi, arg, resi, res, refi, reff);
return EXIT_FAILURE;
}
diffsum = diffsum + diff;
argi++;
} while (argi);
printf ("PASSED. max ulp err = %.6f diffsum = %lld\n", maxulp, diffsum);
printf ("Testing cosf ... ");
diffsum = 0;
maxulp = 0;
argi = 0;
do {
arg = uint32_as_float (argi);
res = my_cosf (arg);
ref = cos ((double)arg);
reff = (float)ref;
resi = float_as_uint32 (res);
refi = float_as_uint32 (reff);
ulp = floatUlpErr (res, ref);
if (ulp > maxulp) {
maxulp = ulp;
}
diff = (resi > refi) ? (resi - refi) : (refi - resi);
if (diff > MAX_DIFF) {
printf ("\nerror # %08x (% 15.8e): res=%08x (% 15.8e) ref=%08x (%15.8e)\n", argi, arg, resi, res, refi, reff);
return EXIT_FAILURE;
}
diffsum = diffsum + diff;
argi++;
} while (argi);
diffsum = diffsum + diff;
printf ("PASSED. max ulp err = %.6f diffsum = %lld\n", maxulp, diffsum);
return EXIT_SUCCESS;
}
There's a thread on Mathematics forum where user J. M. ain't a mathematician introduced improved Taylor/Padé idea to approximate cos and sin functions in range [-pi,pi]. Here's sine version translated to C++. This approximation is not as fast as library std::sin() function but might be worth to check if SSE/AVX/FMA implementation helps enough with the speed.
I have not tested ULP error against library sin() nor cos() function but by Julia Function Accuracy Test tool it looks like an excellent approximation method (add below code to the runtest.jl module which belongs to the Julia test suite):
function test_sine(x::AbstractFloat)
f=0.5
z=x*0.5
k=0
while (abs(z)>f)
z*=0.5
k=k+1
end
z2=z^2;
r=z*(1+(z2/105-1)*((z/3)^2))/
(1+(z2/7-4)*((z/3)^2));
while(k > 0)
r = (2*r)/(1-r*r);
k=k-1
end
return (2*r)/(1+r*r)
end
function test_cosine(x::AbstractFloat)
f=0.5
z=x*0.5
k=0
while (abs(z)>f)
z*=0.5
k=k+1
end
z2=z^2;
r=z*(1+(z2/105-1)*((z/3)^2))/
(1+(z2/7-4)*((z/3)^2));
while (k > 0)
r = (2*r)/(1-r*r);
k=k-1
end
return (1-r*r)/(1+r*r)
end
pii = 3.141592653589793238462643383279502884
MAX_SIN(n::Val{pii}, ::Type{Float16}) = 3.1415926535897932f0
MAX_SIN(n::Val{pii}, ::Type{Float32}) = 3.1415926535897932f0
#MAX_SIN(n::Val{pii}, ::Type{Float64}) = 3.141592653589793238462643383279502884
MIN_SIN(n::Val{pii}, ::Type{Float16}) = -3.1415926535897932f0
MIN_SIN(n::Val{pii}, ::Type{Float32}) = -3.1415926535897932f0
#MIN_SIN(n::Val{pii}, ::Type{Float64}) = -3.141592653589793238462643383279502884
for (func, base) in (sin=>Val(pii), test_sine=>Val(pii), cos=>Val(pii), test_cosine=>Val(pii))
for T in (Float16, Float32)
xx = range(MIN_SIN(base,T), MAX_SIN(base,T), length = 10^6);
test_acc(func, xx)
end
end
Results for approximation and sin() and cos() in range [-pi,pi]:
Tol debug failed 0.0% of the time.
sin
ULP max 0.5008857846260071 at x = 2.203355
ULP mean 0.24990503381476237
Test Summary: | Pass Total
Float32 sin | 1 1
Tol debug failed 0.0% of the time.
sin
ULP max 0.5008857846260071 at x = 2.203355
ULP mean 0.24990503381476237
Test Summary: | Pass Total
Float32 sin | 1 1
Tol debug failed 0.0% of the time.
test_sine
ULP max 0.001272978144697845 at x = 2.899093
ULP mean 1.179825295005716e-8
Test Summary: | Pass Total
Float32 test_sine | 1 1
Tol debug failed 0.0% of the time.
test_sine
ULP max 0.001272978144697845 at x = 2.899093
ULP mean 1.179825295005716e-8
Test Summary: | Pass Total
Float32 test_sine | 1 1
Tol debug failed 0.0% of the time.
cos
ULP max 0.5008531212806702 at x = 0.45568538
ULP mean 0.2499933592458589
Test Summary: | Pass Total
Float32 cos | 1 1
Tol debug failed 0.0% of the time.
cos
ULP max 0.5008531212806702 at x = 0.45568538
ULP mean 0.2499933592458589
Test Summary: | Pass Total
Float32 cos | 1 1
Tol debug failed 0.0% of the time.
test_cosine
ULP max 0.0011584102176129818 at x = 1.4495481
ULP mean 1.6793535615395134e-8
Test Summary: | Pass Total
Float32 test_cosine | 1 1
Tol debug failed 0.0% of the time.
test_cosine
ULP max 0.0011584102176129818 at x = 1.4495481
ULP mean 1.6793535615395134e-8
Test Summary: | Pass Total
Float32 test_cosine | 1 1

Which exponentiation algorithms do CPU/programming languages use? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I've been learning about faster exponentiation algorithms (k-ary, sliding door etc.), and was wondering which ones are used in CPUs/programming languages? (I'm fuzzy on whether or not this happens in the CPU or through the compiler)
And just for kicks, which is the fastest?
Edit regarding the broadness: It's intentionally broad because I know there are a bunch of different techniques to do this. The checked answer had what I was looking for.
I assume your interest is in implementation of the exponentiation functions that can be found in standard math libraries for HLLs, in particular C/C++. These include the functions exp(), exp2(), exp10(), and pow(), as well as single-precision counterparts expf(), exp2f(), exp10f(), and powf().
The exponentiation methods you mention (such as k-ary, sliding window) are typically employed in cryptographic algorithms, such as RSA, which is exponentiation based. They are not typically used for the exponentiation functions provided via math.h or cmath. The implementation details for standard math functions like exp() differ, but a common scheme follows a three-step process:
reduction of the function argument to a primary approximation
interval
approximation of a suitable base function on the primary approximation interval
mapping back the result for the primary interval to the entire range of the function
An auxiliary step is often the handling of special cases. These can pertain to special mathematical situations such as log(0.0), or special floating-point operands such as NaN (Not a Number).
The C99 code for expf(float) below shows in exemplary fashion what those steps look like for a concrete example. The argument a is first split such that exp(a) = er * 2i, where i is an integer and r is in [log(sqrt(0.5), log(sqrt(2.0)], the primary approximation interval. In the second step, we now approximate er with a polynomial. Such approximations can be designed according to various design criteria such as minimizing absolute or relative error. The polynomial can be evaluated in various ways including Horner's scheme and Estrin's scheme.
The code below uses a very common approach by employing a minimax approximation, which minimizes the maximum error over the entire approximation interval. A standard algorithm for computing such approximations is the Remez algorithm. Evaluation is via Horner's scheme; the numerical accuracy of this evaluation is enhanced by the use of fmaf().
This standard math function implements what is known as a fused multiply-add or FMA. This computes a*b+c using the full product a*b during addition and applying a single rounding at the end. On most modern hardware, such as GPUs, IBM Power CPUs, recent x86 processors (e.g. Haswell), recent ARM processors (as an optional extension), this maps straight to a hardware instruction. On platforms that lack such an instruction, fmaf() will map to fairly slow emulation code, in which case we would not want to use it if we are interested in performance.
The final computation is the multiplication by 2i, for which C and C++ provide the function ldexp(). In "industrial strength" library code one typically uses a machine-specific idiom here that takes advantage of the use of IEEE-754 binary arithmetic for float. Lastly, the code cleans up cases of overflow and underflow.
The x87 FPU inside x86 processors has an instruction F2XM1 that computes 2x-1 on [-1,1]. This can be used for second step of the computation of exp() and exp2(). There is an instruction FSCALE which is used to multiply by2i in the third step. A common way of implementing F2XM1 itself is as microcode that utilizes a rational or polynomial approximation. Note that the x87 FPU is maintained mostly for legacy support these days. On modern x86 platform libraries typically use pure software implementations based on SSE and algorithms similar to the one shown below. Some combine small tables with polynomial approximations.
pow(x,y) can be conceptually implemented as exp(y*log(x)), but this suffers from significant loss of accuracy when x is near unity and y in large in magnitude, as well as incorrect handling of numerous special cases specified in the C/C++ standards. One way to get around the accuracy issue is to compute log(x) and the product y*log(x)) in some form of extended precision. The details would fill an entire, lengthy separate answer, and I do not have code handy to demonstrate it. In various C/C++ math libraries, pow(double,int) and powf(float, int) are computed by a separate code path that applies the "square-and-multiply" method with bit-wise scanning of the the binary representation of the integer exponent.
#include <math.h> /* import fmaf(), ldexpf(), INFINITY */
/* Like rintf(), but -0.0f -> +0.0f, and |a| must be < 2**22 */
float quick_and_dirty_rintf (float a)
{
const float cvt_magic = 0x1.800000p+23f;
return (a + cvt_magic) - cvt_magic;
}
/* Approximate exp(a) on the interval [log(sqrt(0.5)), log(sqrt(2.0))]. */
float expf_poly (float a)
{
float r;
r = 0x1.694000p-10f; // 1.37805939e-3
r = fmaf (r, a, 0x1.125edcp-07f); // 8.37312452e-3
r = fmaf (r, a, 0x1.555b5ap-05f); // 4.16695364e-2
r = fmaf (r, a, 0x1.555450p-03f); // 1.66664720e-1
r = fmaf (r, a, 0x1.fffff6p-02f); // 4.99999851e-1
r = fmaf (r, a, 0x1.000000p+00f); // 1.00000000e+0
r = fmaf (r, a, 0x1.000000p+00f); // 1.00000000e+0
return r;
}
/* Approximate exp2() on interval [-0.5,+0.5] */
float exp2f_poly (float a)
{
float r;
r = 0x1.418000p-13f; // 1.53303146e-4
r = fmaf (r, a, 0x1.5efa94p-10f); // 1.33887795e-3
r = fmaf (r, a, 0x1.3b2c6cp-07f); // 9.61833261e-3
r = fmaf (r, a, 0x1.c6af8ep-05f); // 5.55036329e-2
r = fmaf (r, a, 0x1.ebfbe0p-03f); // 2.40226507e-1
r = fmaf (r, a, 0x1.62e430p-01f); // 6.93147182e-1
r = fmaf (r, a, 0x1.000000p+00f); // 1.00000000e+0
return r;
}
/* Approximate exp10(a) on [log(sqrt(0.5))/log(10), log(sqrt(2.0))/log(10)] */
float exp10f_poly (float a)
{
float r;
r = 0x1.a56000p-3f; // 0.20574951
r = fmaf (r, a, 0x1.155aa8p-1f); // 0.54170728
r = fmaf (r, a, 0x1.2bda96p+0f); // 1.17130411
r = fmaf (r, a, 0x1.046facp+1f); // 2.03465796
r = fmaf (r, a, 0x1.53524ap+1f); // 2.65094876
r = fmaf (r, a, 0x1.26bb1cp+1f); // 2.30258512
r = fmaf (r, a, 0x1.000000p+0f); // 1.00000000
return r;
}
/* Compute exponential base e. Maximum ulp error = 0.86565 */
float my_expf (float a)
{
float t, r;
int i;
t = a * 0x1.715476p+0f; // 1/log(2); 1.442695
t = quick_and_dirty_rintf (t);
i = (int)t;
r = fmaf (t, -0x1.62e400p-01f, a); // log_2_hi; -6.93145752e-1
r = fmaf (t, -0x1.7f7d1cp-20f, r); // log_2_lo; -1.42860677e-6
t = expf_poly (r);
r = ldexpf (t, i);
if (a < -105.0f) r = 0.0f;
if (a > 105.0f) r = INFINITY; // +INF
return r;
}
/* Compute exponential base 2. Maximum ulp error = 0.86770 */
float my_exp2f (float a)
{
float t, r;
int i;
t = quick_and_dirty_rintf (a);
i = (int)t;
r = a - t;
t = exp2f_poly (r);
r = ldexpf (t, i);
if (a < -152.0f) r = 0.0f;
if (a > 152.0f) r = INFINITY; // +INF
return r;
}
/* Compute exponential base 10. Maximum ulp error = 0.95588 */
float my_exp10f (float a)
{
float r, t;
int i;
t = a * 0x1.a934f0p+1f; // log2(10); 3.321928
t = quick_and_dirty_rintf (t);
i = (int)t;
r = fmaf (t, -0x1.344140p-2f, a); // log10(2)_hi // -3.01030159e-1
r = fmaf (t, 0x1.5ec10cp-23f, r); // log10(2)_lo // 1.63332601e-7
t = exp10f_poly (r);
r = ldexpf (t, i);
if (a < -46.0f) r = 0.0f;
if (a > 46.0f) r = INFINITY; // +INF
return r;
}
#include <string.h>
#include <stdint.h>
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof r);
return r;
}
uint64_t double_as_uint64 (double a)
{
uint64_t r;
memcpy (&r, &a, sizeof r);
return r;
}
double floatUlpErr (float res, double ref)
{
uint64_t i, j, err, refi;
int expoRef;
/* ulp error cannot be computed if either operand is NaN, infinity, zero */
if (isnan (res) || isnan (ref) || isinf (res) || isinf (ref) ||
(res == 0.0f) || (ref == 0.0f)) {
return 0.0;
}
/* Convert the float result to an "extended float". This is like a float
with 56 instead of 24 effective mantissa bits.
*/
i = ((uint64_t)float_as_uint32(res)) << 32;
/* Convert the double reference to an "extended float". If the reference is
>= 2^129, we need to clamp to the maximum "extended float". If reference
is < 2^-126, we need to denormalize because of the float types's limited
exponent range.
*/
refi = double_as_uint64(ref);
expoRef = (int)(((refi >> 52) & 0x7ff) - 1023);
if (expoRef >= 129) {
j = 0x7fffffffffffffffULL;
} else if (expoRef < -126) {
j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
j = j >> (-(expoRef + 126));
} else {
j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
j = j | ((uint64_t)(expoRef + 127) << 55);
}
j = j | (refi & 0x8000000000000000ULL);
err = (i < j) ? (j - i) : (i - j);
return err / 4294967296.0;
}
#include <stdio.h>
#include <stdlib.h>
int main (void)
{
double ref, ulp, maxulp;
float arg, res, reff;
uint32_t argi, resi, refi, diff, sumdiff;
printf ("testing expf ...\n");
argi = 0;
sumdiff = 0;
maxulp = 0;
do {
arg = uint32_as_float (argi);
res = my_expf (arg);
ref = exp ((double)arg);
ulp = floatUlpErr (res, ref);
if (ulp > maxulp) maxulp = ulp;
reff = (float)ref;
refi = float_as_uint32 (reff);
resi = float_as_uint32 (res);
diff = (resi < refi) ? (refi - resi) : (resi - refi);
if (diff > 1) {
printf ("!! expf: arg=%08x res=%08x ref=%08x\n", argi, resi, refi);
return EXIT_FAILURE;
} else {
sumdiff += diff;
}
argi++;
} while (argi);
printf ("expf maxulp=%.5f sumdiff=%u\n", maxulp, sumdiff);
printf ("testing exp2f ...\n");
argi = 0;
maxulp = 0;
sumdiff = 0;
do {
arg = uint32_as_float (argi);
res = my_exp2f (arg);
ref = exp2 ((double)arg);
ulp = floatUlpErr (res, ref);
if (ulp > maxulp) maxulp = ulp;
reff = (float)ref;
refi = float_as_uint32 (reff);
resi = float_as_uint32 (res);
diff = (resi < refi) ? (refi - resi) : (resi - refi);
if (diff > 1) {
printf ("!! expf: arg=%08x res=%08x ref=%08x\n", argi, resi, refi);
return EXIT_FAILURE;
} else {
sumdiff += diff;
}
argi++;
} while (argi);
printf ("exp2f maxulp=%.5f sumdiff=%u\n", maxulp, sumdiff);
printf ("testing exp10f ...\n");
argi = 0;
maxulp = 0;
sumdiff = 0;
do {
arg = uint32_as_float (argi);
res = my_exp10f (arg);
ref = exp10 ((double)arg);
ulp = floatUlpErr (res, ref);
if (ulp > maxulp) maxulp = ulp;
reff = (float)ref;
refi = float_as_uint32 (reff);
resi = float_as_uint32 (res);
diff = (resi < refi) ? (refi - resi) : (resi - refi);
if (diff > 1) {
printf ("!! expf: arg=%08x res=%08x ref=%08x\n", argi, resi, refi);
return EXIT_FAILURE;
} else {
sumdiff += diff;
}
argi++;
} while (argi);
printf ("exp10f maxulp=%.5f sumdiff=%u\n", maxulp, sumdiff);
return EXIT_SUCCESS;
}

Inverse Error Function in C

Is it possible to calculate the inverse error function in C?
I can find erf(x) in <math.h> which calculates the error function, but I can't find anything to do the inverse.
At this time, the ISO C standard math library does not include erfinv(), or its single-precision variant erfinvf(). However, it is not too difficult to create one's own version, which I demonstrate below with an implementation of erfinvf() of reasonable accuracy and performance.
Looking at the graph of the inverse error function we observe that it is highly non-linear and is therefore difficult to approximate with a polynomial. One strategy do deal with this scenario is to "linearize" such a function by compositing it from simpler elementary functions (which can themselves be computed at high performance and excellent accuracy) and a fairly linear function which is more easily amenable to polynomial approximations or rational approximations of low degree.
Here are some approaches to erfinv linearization known from the literature, all of which are based on logarithms. Typically, authors differentiate between a main, fairly linear portion of the inverse error function from zero to a switchover point very roughly around 0.9 and a tail portion from the switchover point to unity. In the following, log() denotes the natural logarithm, R() denotes a rational approximation, and P() denotes a polynomial approximation.
A. J. Strecok, "On the Calculation of the Inverse of the Error Function."
Mathematics of Computation, Vol. 22, No. 101 (Jan. 1968), pp. 144-158 (online)
β(x) = (-log(1-x2]))½; erfinv(x) = x · R(x2) [main]; R(x) · β(x) [tail]
J. M. Blair, C. A. Edwards, J. H. Johnson, "Rational Chebyshev Approximations for the Inverse of the Error Function." Mathematics of Computation, Vol. 30, No. 136 (Oct. 1976), pp. 827-830 (online)
ξ = (-log(1-x))-½; erfinv(x) = x · R(x2) [main]; ξ-1 · R(ξ) [tail]
M. Giles, "Approximating the erfinv function." In GPU Computing Gems Jade Edition, pp. 109-116. 2011. (online)
w = -log(1-x2); s = √w; erfinv(x) = x · P(w) [main]; x · P(s) [tail]
The solution below generally follows the approach by Giles, but simplifies it in not requiring the square root for the tail portion, i.e. it uses two approximations of the type x · P(w). The code takes maximum advantage of the fused multiply-add operation FMA, which is exposed via the standard math functions fma() and fmaf() in C. Many common compute platforms, such as
IBM Power, Arm64, x86-64, and GPUs offer this operation in hardware. Where no hardware support exists, the use of fma{f}() will likely make the code below unacceptably slow as the operation needs to be emulated by the standard math library. Also, functionally incorrect emulations of FMA are known to exist.
The accuracy of the standard math library's logarithm function logf() will have some impact on the accuracy of my_erfinvf() below. As long as the library provides a faithfully-rounded implementation with error < 1 ulp, the stated error bound should hold and it did for the few libraries I tried. For improved reproducability, I have included my own portable faithfully-rounded implementation, my_logf().
#include <math.h>
float my_logf (float);
/* compute inverse error functions with maximum error of 2.35793 ulp */
float my_erfinvf (float a)
{
float p, r, t;
t = fmaf (a, 0.0f - a, 1.0f);
t = my_logf (t);
if (fabsf(t) > 6.125f) { // maximum ulp error = 2.35793
p = 3.03697567e-10f; // 0x1.4deb44p-32
p = fmaf (p, t, 2.93243101e-8f); // 0x1.f7c9aep-26
p = fmaf (p, t, 1.22150334e-6f); // 0x1.47e512p-20
p = fmaf (p, t, 2.84108955e-5f); // 0x1.dca7dep-16
p = fmaf (p, t, 3.93552968e-4f); // 0x1.9cab92p-12
p = fmaf (p, t, 3.02698812e-3f); // 0x1.8cc0dep-9
p = fmaf (p, t, 4.83185798e-3f); // 0x1.3ca920p-8
p = fmaf (p, t, -2.64646143e-1f); // -0x1.0eff66p-2
p = fmaf (p, t, 8.40016484e-1f); // 0x1.ae16a4p-1
} else { // maximum ulp error = 2.35002
p = 5.43877832e-9f; // 0x1.75c000p-28
p = fmaf (p, t, 1.43285448e-7f); // 0x1.33b402p-23
p = fmaf (p, t, 1.22774793e-6f); // 0x1.499232p-20
p = fmaf (p, t, 1.12963626e-7f); // 0x1.e52cd2p-24
p = fmaf (p, t, -5.61530760e-5f); // -0x1.d70bd0p-15
p = fmaf (p, t, -1.47697632e-4f); // -0x1.35be90p-13
p = fmaf (p, t, 2.31468678e-3f); // 0x1.2f6400p-9
p = fmaf (p, t, 1.15392581e-2f); // 0x1.7a1e50p-7
p = fmaf (p, t, -2.32015476e-1f); // -0x1.db2aeep-3
p = fmaf (p, t, 8.86226892e-1f); // 0x1.c5bf88p-1
}
r = a * p;
return r;
}
/* compute natural logarithm with a maximum error of 0.85089 ulp */
float my_logf (float a)
{
float i, m, r, s, t;
int e;
m = frexpf (a, &e);
if (m < 0.666666667f) { // 0x1.555556p-1
m = m + m;
e = e - 1;
}
i = (float)e;
/* m in [2/3, 4/3] */
m = m - 1.0f;
s = m * m;
/* Compute log1p(m) for m in [-1/3, 1/3] */
r = -0.130310059f; // -0x1.0ae000p-3
t = 0.140869141f; // 0x1.208000p-3
r = fmaf (r, s, -0.121484190f); // -0x1.f19968p-4
t = fmaf (t, s, 0.139814854f); // 0x1.1e5740p-3
r = fmaf (r, s, -0.166846052f); // -0x1.55b362p-3
t = fmaf (t, s, 0.200120345f); // 0x1.99d8b2p-3
r = fmaf (r, s, -0.249996200f); // -0x1.fffe02p-3
r = fmaf (t, m, r);
r = fmaf (r, m, 0.333331972f); // 0x1.5554fap-2
r = fmaf (r, m, -0.500000000f); // -0x1.000000p-1
r = fmaf (r, s, m);
r = fmaf (i, 0.693147182f, r); // 0x1.62e430p-1 // log(2)
if (!((a > 0.0f) && (a <= 3.40282346e+38f))) { // 0x1.fffffep+127
r = a + a; // silence NaNs if necessary
if (a < 0.0f) r = ( 0.0f / 0.0f); // NaN
if (a == 0.0f) r = (-1.0f / 0.0f); // -Inf
}
return r;
}
Quick & dirty, tolerance under +-6e-3. Work based on "A handy approximation for the error function and its inverse" by Sergei Winitzki.
C/C++ CODE:
float myErfInv2(float x){
float tt1, tt2, lnx, sgn;
sgn = (x < 0) ? -1.0f : 1.0f;
x = (1 - x)*(1 + x); // x = 1 - x*x;
lnx = logf(x);
tt1 = 2/(PI*0.147) + 0.5f * lnx;
tt2 = 1/(0.147) * lnx;
return(sgn*sqrtf(-tt1 + sqrtf(tt1*tt1 - tt2)));
}
MATLAB sanity check:
clear all, close all, clc
x = linspace(-1, 1,10000);
% x = 1 - logspace(-8,-15,1000);
a = 0.15449436008930206298828125;
% a = 0.147;
u = log(1-x.^2);
u1 = 2/(pi*a) + u/2; u2 = u/a;
y = sign(x).*sqrt(-u1+sqrt(u1.^2 - u2));
f = erfinv(x); axis equal
figure(1);
plot(x, [y; f]); legend('Approx. erf(x)', 'erf(x)')
figure(2);
e = f-y;
plot(x, e);
MATLAB Plots:
I don't think it's a standard implementation in <math.h>, but there are other C math libraries that have implement the inverse error function erfinv(x), that you can use.
Also quick and dirty: if less precision is allowed than I share my own approximation with the inverse hyperbolic tangent - the parameters are sought by monte carle simulation where all random values are between the range of 0.5 and 1.5:
p1 = 1.4872301551536515
p2 = 0.5739159012216655
p3 = 0.5803635928651558
atanh( p^( 1 / p3 ) ) / p2 )^( 1 / p1 )
This comes from the algebraic reordering of my erf function approximation with the hyperbolic tangent, where the RMSE error is 0.000367354 for x between 1 and 4:
tanh( x^p1 * p2 )^p3
I wrote another method that uses the fast-converging Newton-Rhapson method, which is an iterative method to find the root of a function. It starts with an initial guess and then iteratively improves the guess by using the derivative of the function. The Newton-Raphson method requires the function, its derivative, an initial guess and a stopping criteria.
In this case, the function we are trying to find the root of is erf(x) - x. And the derivative of this function is 2.0 / sqrt(pi) * exp(-x**2). The initial guess is the input value for x. The stopping criteria is a tolerance value, in this case it's 1.0e-16. Here is the code:
/*
============================================
Compile and execute with:
$ gcc inverf.c -o inverf -lm
$ ./inverf
============================================
*/
#include <stdio.h>
#include <math.h>
int main() {
double x, result, fx, dfx, dx, xold;
double tolerance = 1.0e-16;
double pi = 4.0 * atan(1.0);
int iteration, i;
// input value for x
printf("Calculator for inverse error function.\n");
printf("Enter the value for x: ");
scanf("%lf", &x);
// check the input value is between -1 and 1
if (x < -1.0 || x > 1.0) {
printf("Invalid input, x must be between -1 and 1.");
return 0;
}
// initial guess
result = x;
xold = 0.0;
iteration = 0;
// iterate until the solution converges
do {
xold = result;
fx = erf(result) - x;
dfx = 2.0 / sqrt(pi) * exp(-pow(result, 2.0));
dx = fx / dfx;
// update the solution
result = result - dx;
iteration = iteration + 1;
} while (fabs(result - xold) >= tolerance);
// output the result
printf("The inverse error function of %lf is %lf\n", x, result);
printf("Number of iterations: %d\n", iteration);
return 0;
}
In the terminal it should look something like this:
Calculator for inverse error function.
Enter the value for x: 0.5
The inverse error function of 0.500000 is 0.476936
Number of iterations: 5

Resources