Related
I'm trying to find ex without using math.h. My code gives wrong anwsers when x is bigger or lower than ~±20. I tried to change all double types to long double types, but it gave some trash on input.
My code is:
#include <stdio.h>
double fabs1(double x) {
if(x >= 0){
return x;
} else {
return x*(-1);
}
}
double powerex(double x) {
double a = 1.0, e = a;
for (int n = 1; fabs1(a) > 0.001; ++n) {
a = a * x / n;
e += a;
}
return e;
}
int main(){
freopen("input.txt", "r", stdin);
freopen("output.txt", "w", stdout);
int n;
scanf("%d", &n);
for(int i = 0; i<n; i++) {
double number;
scanf("%lf", &number);
double e = powerex(number);
printf("%0.15g\n", e);
}
return 0;
}
Input:
8
0.0
1.0
-1.0
2.0
-2.0
100.0
-100.0
0.189376476361643
My output:
1
2.71825396825397
0.367857142857143
7.38899470899471
0.135379188712522
2.68811714181613e+043
-2.91375564689153e+025
1.20849374134639
Right output:
1
2.71828182845905
0.367879441171442
7.38905609893065
0.135335283236613
2.68811714181614e+43
3.72007597602084e-44
1.20849583696666
You can see that my answer for e−100 is absolutely incorrect. Why does my code output this? What can I do to improve this algorithm?
When x is negative, the sign of each term alternates. This means each successive sum switches widely in value rather than increasing more gradually when a positive power is used. This means that the loss in precision with successive terms has a large effect on the result.
To handle this, check the sign of x at the start. If it is negative, switch the sign of x to perform the calculation, then when you reach the end of the loop invert the result.
Also, you can reduce the number of iterations by using the following counterintuitive condtion:
e != e + a
On its face, it appears that this should always be true. However, the condition becomes false when the value of a is outside of the precision of the value of e, in which case adding a to e doesn't change the value of e.
double powerex(double x) {
double a = 1.0, e = a;
int invert = x<0;
x = fabs1(x);
for (int n = 1; e != e + a ; ++n) {
a = a * x / n;
e += a;
}
return invert ? 1/e : e;
}
We can optimize a bit more to remove one loop iteration by initializing e with 0 instead of a, and calculating the next term at the bottom of the loop instead of the top:
double powerex(double x) {
double a = 1.0, e = 0;
int invert = x<0;
x = fabs1(x);
for (int n = 1; e != e + a ; ++n) {
e += a;
a = a * x / n;
}
return invert ? 1/e : e;
}
For values of x above one or so, you may consider to handle the integer part separately and compute powers of e by squarings. (E.g. e^9 = ((e²)²)².e takes 4 multiplies)
Indeed, the general term of the Taylor development, x^n/n! only starts to decrease after n>x (you multiply each time by x/k), so the summation takes at least x terms. On another hand, e^n can be computed in at most 2lg(n) multiplies, which is more efficient and more accurate.
So I would advise
to take the fractional part of x and use Taylor,
when the integer part is positive, multiply by e raised to that power,
when the integer part is zero, you are done,
when the integer part is negative, divide by e raised to that power.
You can even spare more by considering quarters: in the worst case (x=1), Taylor requires 18 terms before the last one becomes negligible. If you consider subtracting from x the immediately inferior multiple of 1/4 (and compensate multiplying by precomputed powers of e), the number of terms drops to 12.
E.g. e^0.8 = e^(3/4+0.05) = 2.1170000166126747 . e^0.05
I'm looking for an approximation of the natural exponential function operating on SSE element. Namely - __m128 exp( __m128 x ).
I have an implementation which is quick but seems to be very low in accuracy:
static inline __m128 FastExpSse(__m128 x)
{
__m128 a = _mm_set1_ps(12102203.2f); // (1 << 23) / ln(2)
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128 m87 = _mm_set1_ps(-87);
// fast exponential function, x should be in [-87, 87]
__m128 mask = _mm_cmpge_ps(x, m87);
__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}
Could anybody have an implementation with better accuracy yet as fast (Or faster)?
I'd be happy if it is written in C Style.
Thank You.
The C code below is a translation into SSE intrinsics of an algorithm I used in a previous answer to a similar question.
The basic idea is to transform the computation of the standard exponential function into computation of a power of 2: expf (x) = exp2f (x / logf (2.0f)) = exp2f (x * 1.44269504). We split t = x * 1.44269504 into an integer i and a fraction f, such that t = i + f and 0 <= f <= 1. We can now compute 2f with a polynomial approximation, then scale the result by 2i by adding i to the exponent field of the single-precision floating-point result.
One problem that exists with an SSE implementation is that we want to compute i = floorf (t), but there is no fast way to compute the floor() function. However, we observe that for positive numbers, floor(x) == trunc(x), and that for negative numbers, floor(x) == trunc(x) - 1, except when x is a negative integer. However, since the core approximation can handle an f value of 1.0f, using the approximation for negative arguments is harmless. SSE provides an instruction to convert single-precision floating point operands to integers with truncation, so this solution is efficient.
Peter Cordes points out that SSE4.1 supports a fast floor function _mm_floor_ps(), so a variant using SSE4.1 is also shown below. Not all toolchains automatically predefine the macro __SSE4_1__ when SSE 4.1 code generation is enabled, but gcc does.
Compiler Explorer (Godbolt) shows that gcc 7.2 compiles the code below into sixteen instructions for plain SSE and twelve instructions for SSE 4.1.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <emmintrin.h>
#ifdef __SSE4_1__
#include <smmintrin.h>
#endif
/* max. rel. error = 1.72863156e-3 on [-87.33654, 88.72283] */
__m128 fast_exp_sse (__m128 x)
{
__m128 t, f, e, p, r;
__m128i i, j;
__m128 l2e = _mm_set1_ps (1.442695041f); /* log2(e) */
__m128 c0 = _mm_set1_ps (0.3371894346f);
__m128 c1 = _mm_set1_ps (0.657636276f);
__m128 c2 = _mm_set1_ps (1.00172476f);
/* exp(x) = 2^i * 2^f; i = floor (log2(e) * x), 0 <= f <= 1 */
t = _mm_mul_ps (x, l2e); /* t = log2(e) * x */
#ifdef __SSE4_1__
e = _mm_floor_ps (t); /* floor(t) */
i = _mm_cvtps_epi32 (e); /* (int)floor(t) */
#else /* __SSE4_1__*/
i = _mm_cvttps_epi32 (t); /* i = (int)t */
j = _mm_srli_epi32 (_mm_castps_si128 (x), 31); /* signbit(t) */
i = _mm_sub_epi32 (i, j); /* (int)t - signbit(t) */
e = _mm_cvtepi32_ps (i); /* floor(t) ~= (int)t - signbit(t) */
#endif /* __SSE4_1__*/
f = _mm_sub_ps (t, e); /* f = t - floor(t) */
p = c0; /* c0 */
p = _mm_mul_ps (p, f); /* c0 * f */
p = _mm_add_ps (p, c1); /* c0 * f + c1 */
p = _mm_mul_ps (p, f); /* (c0 * f + c1) * f */
p = _mm_add_ps (p, c2); /* p = (c0 * f + c1) * f + c2 ~= 2^f */
j = _mm_slli_epi32 (i, 23); /* i << 23 */
r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/
return r;
}
int main (void)
{
union {
float f[4];
unsigned int i[4];
} arg, res;
double relerr, maxrelerr = 0.0;
int i, j;
__m128 x, y;
float start[2] = {-0.0f, 0.0f};
float finish[2] = {-87.33654f, 88.72283f};
for (i = 0; i < 2; i++) {
arg.f[0] = start[i];
arg.i[1] = arg.i[0] + 1;
arg.i[2] = arg.i[0] + 2;
arg.i[3] = arg.i[0] + 3;
do {
memcpy (&x, &arg, sizeof(x));
y = fast_exp_sse (x);
memcpy (&res, &y, sizeof(y));
for (j = 0; j < 4; j++) {
double ref = exp ((double)arg.f[j]);
relerr = fabs ((res.f[j] - ref) / ref);
if (relerr > maxrelerr) {
printf ("arg=% 15.8e res=%15.8e ref=%15.8e err=%15.8e\n",
arg.f[j], res.f[j], ref, relerr);
maxrelerr = relerr;
}
}
arg.i[0] += 4;
arg.i[1] += 4;
arg.i[2] += 4;
arg.i[3] += 4;
} while (fabsf (arg.f[3]) < fabsf (finish[i]));
}
printf ("maximum relative errror = %15.8e\n", maxrelerr);
return EXIT_SUCCESS;
}
An alternative design for fast_sse_exp() extracts the integer portion of the adjusted argument x / log(2) in round-to-nearest mode, using the well-known technique of adding the "magic" conversion constant 1.5 * 223 to force rounding in the correct bit position, then subtracting out the same number again. This requires that the SSE rounding mode in effect during the addition is "round to nearest or even", which is the default. wim pointed out in comments that some compilers may optimize out the addition and subtraction of the conversion constant cvt as redundant when aggressive optimization is used, interfering with the functionality of this code sequence, so it is recommended to inspect the machine code generated. The approximation interval for computation of 2f is now centered around zero, since -0.5 <= f <= 0.5, requiring a different core approximation.
/* max. rel. error <= 1.72860465e-3 on [-87.33654, 88.72283] */
__m128 fast_exp_sse (__m128 x)
{
__m128 t, f, p, r;
__m128i i, j;
const __m128 l2e = _mm_set1_ps (1.442695041f); /* log2(e) */
const __m128 cvt = _mm_set1_ps (12582912.0f); /* 1.5 * (1 << 23) */
const __m128 c0 = _mm_set1_ps (0.238428936f);
const __m128 c1 = _mm_set1_ps (0.703448006f);
const __m128 c2 = _mm_set1_ps (1.000443142f);
/* exp(x) = 2^i * 2^f; i = rint (log2(e) * x), -0.5 <= f <= 0.5 */
t = _mm_mul_ps (x, l2e); /* t = log2(e) * x */
r = _mm_sub_ps (_mm_add_ps (t, cvt), cvt); /* r = rint (t) */
f = _mm_sub_ps (t, r); /* f = t - rint (t) */
i = _mm_cvtps_epi32 (t); /* i = (int)t */
p = c0; /* c0 */
p = _mm_mul_ps (p, f); /* c0 * f */
p = _mm_add_ps (p, c1); /* c0 * f + c1 */
p = _mm_mul_ps (p, f); /* (c0 * f + c1) * f */
p = _mm_add_ps (p, c2); /* p = (c0 * f + c1) * f + c2 ~= exp2(f) */
j = _mm_slli_epi32 (i, 23); /* i << 23 */
r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/
return r;
}
The algorithm for the code in the question appears to be taken from the work of Nicol N. Schraudolph, which cleverly exploits the semi-logarithmic nature of IEEE-754 binary floating-point formats:
N. N. Schraudolph. "A fast, compact approximation of the exponential function." Neural Computation, 11(4), May 1999, pp.853-862.
After removal of the argument clamping code, it reduces to just three SSE instructions. The "magical" correction constant 486411 is not optimal for minimizing maximum relative error over the entire input domain. Based on simple binary search, the value 298765 seems to be superior, reducing maximum relative error for FastExpSse() to 3.56e-2 vs. maximum relative error of 1.73e-3 for fast_exp_sse().
/* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */
__m128 FastExpSse (__m128 x)
{
__m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
__m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765);
__m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b);
return _mm_castsi128_ps (t);
}
Schraudolph's algorithm basically uses the linear approximation 2f ~= 1.0 + f for f in [0,1], and its accuracy could be improved by adding a quadratic term. The clever part of Schraudolph's approach is computing 2i * 2f without explicitly separating the integer portion i = floor(x * 1.44269504) from the fraction. I see no way to extend that trick to a quadratic approximation, but one can certainly combine the floor() computation from Schraudolph with the quadratic approximation used above:
/* max. rel. error <= 1.72886892e-3 on [-87.33654, 88.72283] */
__m128 fast_exp_sse (__m128 x)
{
__m128 f, p, r;
__m128i t, j;
const __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
const __m128i m = _mm_set1_epi32 (0xff800000); /* mask for integer bits */
const __m128 ttm23 = _mm_set1_ps (1.1920929e-7f); /* exp2(-23) */
const __m128 c0 = _mm_set1_ps (0.3371894346f);
const __m128 c1 = _mm_set1_ps (0.657636276f);
const __m128 c2 = _mm_set1_ps (1.00172476f);
t = _mm_cvtps_epi32 (_mm_mul_ps (a, x));
j = _mm_and_si128 (t, m); /* j = (int)(floor (x/log(2))) << 23 */
t = _mm_sub_epi32 (t, j);
f = _mm_mul_ps (ttm23, _mm_cvtepi32_ps (t)); /* f = (x/log(2)) - floor (x/log(2)) */
p = c0; /* c0 */
p = _mm_mul_ps (p, f); /* c0 * f */
p = _mm_add_ps (p, c1); /* c0 * f + c1 */
p = _mm_mul_ps (p, f); /* (c0 * f + c1) * f */
p = _mm_add_ps (p, c2); /* p = (c0 * f + c1) * f + c2 ~= 2^f */
r = _mm_castsi128_ps (_mm_add_epi32 (j, _mm_castps_si128 (p))); /* r = p * 2^i*/
return r;
}
A good increase in accuracy in my algorithm (implementation FastExpSse in the answer above) can be obtained at the cost of an integer subtraction and floating-point division by using FastExpSse(x/2)/FastExpSse(-x/2) instead of FastExpSse(x). The trick here is to set the shift parameter (298765 above) to zero so that the piecewise linear approximations in the numerator and denominator line up to give you substantial error cancellation. Roll it into a single function:
__m128 BetterFastExpSse (__m128 x)
{
const __m128 a = _mm_set1_ps ((1 << 22) / float(M_LN2)); // to get exp(x/2)
const __m128i b = _mm_set1_epi32 (127 * (1 << 23)); // NB: zero shift!
__m128i r = _mm_cvtps_epi32 (_mm_mul_ps (a, x));
__m128i s = _mm_add_epi32 (b, r);
__m128i t = _mm_sub_epi32 (b, r);
return _mm_div_ps (_mm_castsi128_ps (s), _mm_castsi128_ps (t));
}
(I'm not a hardware guy - how bad a performance killer is the division here?)
If you need exp(x) just to get y = tanh(x) (e.g. for neural networks), use FastExpSse with zero shift as follows:
a = FastExpSse(x);
b = FastExpSse(-x);
y = (a - b)/(a + b);
to get the same type of error cancellation benefit. The logistic function works similarly, using FastExpSse(x/2)/(FastExpSse(x/2) + FastExpSse(-x/2)) with zero shift. (This is just to show the principle - you obviously don't want to evaluate FastExpSse multiple times here, but roll it into a single function along the lines of BetterFastExpSse above.)
I did develop a series of higher-order approximations from this, ever more accurate but also slower. Unpublished but happy to collaborate if anyone wants to give them a spin.
And finally, for some fun: use in reverse gear to get FastLogSse. Chaining that with FastExpSse gives you both operator and error cancellation, and out pops a blazingly fast power function...
Going back through my notes from way back then, I did explore ways to improve the accuracy without using division. I used the same reinterpret-as-float trick but applied a polynomial correction to the mantissa which was essentially calculated in 16-bit fixed-point arithmetic (the only way to do it fast back then).
The cubic resp. quartic versions give you 4 resp. 5 significant digits of accuracy. There was no point increasing the order beyond that, as the noise of the low-precision arithmetic then starts to drown out the error of the polynomial approximation. Here are the plain C versions:
#include <stdint.h>
float fastExp3(register float x) // cubic spline approximation
{
union { float f; int32_t i; } reinterpreter;
reinterpreter.i = (int32_t)(12102203.0f*x) + 127*(1 << 23);
int32_t m = (reinterpreter.i >> 7) & 0xFFFF; // copy mantissa
// empirical values for small maximum relative error (8.34e-5):
reinterpreter.i +=
((((((((1277*m) >> 14) + 14825)*m) >> 14) - 79749)*m) >> 11) - 626;
return reinterpreter.f;
}
float fastExp4(register float x) // quartic spline approximation
{
union { float f; int32_t i; } reinterpreter;
reinterpreter.i = (int32_t)(12102203.0f*x) + 127*(1 << 23);
int32_t m = (reinterpreter.i >> 7) & 0xFFFF; // copy mantissa
// empirical values for small maximum relative error (1.21e-5):
reinterpreter.i += (((((((((((3537*m) >> 16)
+ 13668)*m) >> 18) + 15817)*m) >> 14) - 80470)*m) >> 11);
return reinterpreter.f;
}
The quartic one obeys (fastExp4(0f) == 1f), which can be important for fixed-point iteration algorithms.
How efficient are these integer multiply-shift-add sequences in SSE? On architectures where float arithmetic is just as fast, one could use that instead, reducing the arithmetic noise. This would essentially yield cubic and quartic extensions of #njuffa's answer above.
There is a paper about creating fast versions of these equations (tanh, cosh, artanh, sinh, etc):
http://ijeais.org/wp-content/uploads/2018/07/IJAER180702.pdf
"Creating a Compiler Optimized Inlineable Implementation of Intel Svml Simd Intrinsics"
their tanh equation 6, on page 9 is very similar to #NicSchraudolph answer
For softmax use, I'm envisioning the flow as:
auto a = _mm_mul_ps(x, _mm_set1_ps(12102203.2f));
auto b = _mm_castsi128_ps(_mm_cvtps_epi32(a)); // so far as in other variants
// copy 9 MSB from 0x3f800000 over 'b' so that 1 <= c < 2
// - also 1 <= poly_eval(...) < 2
auto c = replace_exponent(b, _mm_set1_ps(1.0f));
auto d = poly_eval(c, kA, kB, kC); // 2nd degree polynomial
auto e = replace_exponent(d, b); // restore exponent : 2^i * 2^f
The exponent copying can be done as bitwise select using a proper mask (AVX-512 has vpternlogd, and I'm using actually Arm Neon vbsl).
All the input values x must be negative and clamped between -17-f(N) <= x <= -f(N), so that when scaled by (1<<23)/log(2), the maximum sum of the N resulting floating point values do not reach infinity and that the reciprocal does not become denormal. For N=3, f(N) = 4. Larger f(N) will trade off input precision.
The polyeval coefficients are generated for example by polyfit([1 1.5 2],[1 sqrt(2) 2]), with kA=0.343146, kB=-0.029437, kC=0.68292, producing strictly values smaller than 2 and preventing discontinuities. The maximum average error can be diminished by evaluating the polynomial at x=[1+max_err 1.5-eps 2], y=[1 2^(.5-eps) 2-max_err].
For strictly SSE/AVX, exponent replacement for 1.0f can be done by (x & 0x007fffff) | 0x3f800000). A two instruction sequence for the latter exponent replacement can be found by ensuring that poly_eval(x) evaluates to a range, which can be directly ored with b & 0xff800000.
I have developed for my purposes the following function that calculates quickly and accurately the natural exponent with single precision. The function works in the entire range of float values. The code is written under Visual Studio (x86). AVX is used instead of SSE, but that shouldn't be a problem. The accuracy of this function is almost the same as standard expf function, but significantly faster. Used approximation is based on the Chebyshev series expansion of the function f(t)=t/(2^(t/2)-1)+t/2 for t from the [-1; 1]. I thank Peter Cordes for his good advice.
_declspec(naked) float _vectorcall fexp(float x)
{
static const float ct[7] = // Constants table
{
1.44269502f, // lb(e)
1.92596299E-8f, // Correction to the value lb(e)
-9.21120925E-4f, // 16*b2
0.115524396f, // 4*b1
2.88539004f, // b0
2.0f, // 2
4.65661287E-10f // 2^-31
};
_asm
{
mov ecx,offset ct // ecx contains the address of constants tables
vmulss xmm1,xmm0,[ecx] // xmm1 = x*lb(e)
vcvtss2si eax,xmm1 // eax = round(x*lb(e)) = k
cdq // edx=-1, if x<0 or overflow, otherwise edx=0
vmovss xmm3,[ecx+8] // Initialize the sum with highest coefficient 16*b2
and edx,4 // edx=4, if x<0 or overflow, otherwise edx=0
vcvtsi2ss xmm1,xmm1,eax // xmm1 = k
lea eax,[eax+8*edx] // Add 32 to exponent, if x<0
vfmsub231ss xmm1,xmm0,[ecx] // xmm1 = x*lb(e)-k = t/2 in the range from -0,5 to 0,5
add eax,126 // The exponent of 2^(k-1) or 2^(k+31) with bias 127
jle exp_low // Jump if x<<0 or overflow (|x| too large or x=NaN)
vfmadd132ss xmm0,xmm1,[ecx+4] // xmm0 = t/2 (corrected value)
cmp eax,254 // Check that the exponent is not too large
jg exp_inf // Jump to set Inf if overflow
vmulss xmm2,xmm0,xmm0 // xmm2 = t^2/4 - the argument of the polynomial
shl eax,23 // The bits of the float value 2^(k-1) or 2^(k+31)
vfmadd213ss xmm3,xmm2,[ecx+12] // xmm3 = 4*b1+4*b2*t^2
vmovd xmm1,eax // xmm1 = 2^(k-1) или 2^(k+31)
vfmsub213ss xmm3,xmm2,xmm0 // xmm3 = -t/2+b1*t^2+b2*t^4
vaddss xmm0,xmm0,xmm0 // xmm0 = t
vaddss xmm3,xmm3,[ecx+16] // xmm3 = b0-t/2+b1*t^2+b2*t^4 = f(t)-t/2
vdivss xmm0,xmm0,xmm3 // xmm0 = t/(f(t)-t/2)
vfmadd213ss xmm0,xmm1,xmm1 // xmm0 = e^x with shifted exponent of -1 or 31
vmulss xmm0,xmm0,[ecx+edx+20] // xmm0 = e^x
ret // Return
exp_low: // Handling the case of x<<0 or overflow
vucomiss xmm0,[ecx] // Check the sign of x and a condition x=NaN
jp exp_end // Complete with NaN result, if x=NaN
exp_inf: // Entry point for processing large x
vxorps xmm0,xmm0,xmm0 // xmm0 = 0
jc exp_end // Ready, if x<<0
vrcpss xmm0,xmm0,xmm0 // xmm0 = Inf in case x>>0
exp_end: // The result at xmm0 is ready
ret // Return
}
}
Below I post a simplified algorithm. Support for denormalized numbers in the result is removed here.
_declspec(naked) float _vectorcall fexp(float x)
{
static const float ct[5] = // Constants table
{
1.44269502f, // lb(e)
1.92596299E-8f, // Correction to the value lb(e)
-9.21120925E-4f, // 16*b2
0.115524396f, // 4*b1
2.88539004f // b0
};
_asm
{
mov edx,offset ct // edx contains the address of constants tables
vmulss xmm1,xmm0,[edx] // xmm1 = x*lb(e)
vcvtss2si eax,xmm1 // eax = round(x*lb(e)) = k
vmovss xmm3,[edx+8] // Initialize the sum with highest coefficient 16*b2
vcvtsi2ss xmm1,xmm1,eax // xmm1 = k
cmp eax,127 // Check that the exponent is not too large
jg exp_break // Jump to set Inf if overflow
vfmsub231ss xmm1,xmm0,[edx] // xmm1 = x*lb(e)-k = t/2 in the range from -0,5 to 0,5
add eax,127 // Receive the exponent of 2^k with the bias 127
jle exp_break // The result is 0, if x<<0
vfmadd132ss xmm0,xmm1,[edx+4] // xmm0 = t/2 (corrected value)
vmulss xmm2,xmm0,xmm0 // xmm2 = t^2/4 - the argument of polynomial
shl eax,23 // eax contains the bits of 2^k
vfmadd213ss xmm3,xmm2,[edx+12] // xmm3 = 4*b1+4*b2*t^2
vmovd xmm1,eax // xmm1 = 2^k
vfmsub213ss xmm3,xmm2,xmm0 // xmm3 = -t/2+b1*t^2+b2*t^4
vaddss xmm0,xmm0,xmm0 // xmm0 = t
vaddss xmm3,xmm3,[edx+16] // xmm3 = b0-t/2+b1*t^2+b2*t^4 = f(t)-t/2
vdivss xmm0,xmm0,xmm3 // xmm0 = t/(f(t)-t/2)
vfmadd213ss xmm0,xmm1,xmm1 // xmm0 = 2^k*(t/(f(t)-t/2)+1) = e^x
ret // Return
exp_break: // Get 0 for x<0 or Inf for x>>0
vucomiss xmm0,[edx] // Check the sign of x and a condition x=NaN
jp exp_end // Complete with NaN result, if x=NaN
vxorps xmm0,xmm0,xmm0 // xmm0 = 0
jc exp_end // Ready, if x<<0
vrcpss xmm0,xmm0,xmm0 // xmm0 = Inf, if x>>0
exp_end: // The result at xmm0 is ready
ret // Return
}
}
Is it possible to calculate the inverse error function in C?
I can find erf(x) in <math.h> which calculates the error function, but I can't find anything to do the inverse.
At this time, the ISO C standard math library does not include erfinv(), or its single-precision variant erfinvf(). However, it is not too difficult to create one's own version, which I demonstrate below with an implementation of erfinvf() of reasonable accuracy and performance.
Looking at the graph of the inverse error function we observe that it is highly non-linear and is therefore difficult to approximate with a polynomial. One strategy do deal with this scenario is to "linearize" such a function by compositing it from simpler elementary functions (which can themselves be computed at high performance and excellent accuracy) and a fairly linear function which is more easily amenable to polynomial approximations or rational approximations of low degree.
Here are some approaches to erfinv linearization known from the literature, all of which are based on logarithms. Typically, authors differentiate between a main, fairly linear portion of the inverse error function from zero to a switchover point very roughly around 0.9 and a tail portion from the switchover point to unity. In the following, log() denotes the natural logarithm, R() denotes a rational approximation, and P() denotes a polynomial approximation.
A. J. Strecok, "On the Calculation of the Inverse of the Error Function."
Mathematics of Computation, Vol. 22, No. 101 (Jan. 1968), pp. 144-158 (online)
β(x) = (-log(1-x2]))½; erfinv(x) = x · R(x2) [main]; R(x) · β(x) [tail]
J. M. Blair, C. A. Edwards, J. H. Johnson, "Rational Chebyshev Approximations for the Inverse of the Error Function." Mathematics of Computation, Vol. 30, No. 136 (Oct. 1976), pp. 827-830 (online)
ξ = (-log(1-x))-½; erfinv(x) = x · R(x2) [main]; ξ-1 · R(ξ) [tail]
M. Giles, "Approximating the erfinv function." In GPU Computing Gems Jade Edition, pp. 109-116. 2011. (online)
w = -log(1-x2); s = √w; erfinv(x) = x · P(w) [main]; x · P(s) [tail]
The solution below generally follows the approach by Giles, but simplifies it in not requiring the square root for the tail portion, i.e. it uses two approximations of the type x · P(w). The code takes maximum advantage of the fused multiply-add operation FMA, which is exposed via the standard math functions fma() and fmaf() in C. Many common compute platforms, such as
IBM Power, Arm64, x86-64, and GPUs offer this operation in hardware. Where no hardware support exists, the use of fma{f}() will likely make the code below unacceptably slow as the operation needs to be emulated by the standard math library. Also, functionally incorrect emulations of FMA are known to exist.
The accuracy of the standard math library's logarithm function logf() will have some impact on the accuracy of my_erfinvf() below. As long as the library provides a faithfully-rounded implementation with error < 1 ulp, the stated error bound should hold and it did for the few libraries I tried. For improved reproducability, I have included my own portable faithfully-rounded implementation, my_logf().
#include <math.h>
float my_logf (float);
/* compute inverse error functions with maximum error of 2.35793 ulp */
float my_erfinvf (float a)
{
float p, r, t;
t = fmaf (a, 0.0f - a, 1.0f);
t = my_logf (t);
if (fabsf(t) > 6.125f) { // maximum ulp error = 2.35793
p = 3.03697567e-10f; // 0x1.4deb44p-32
p = fmaf (p, t, 2.93243101e-8f); // 0x1.f7c9aep-26
p = fmaf (p, t, 1.22150334e-6f); // 0x1.47e512p-20
p = fmaf (p, t, 2.84108955e-5f); // 0x1.dca7dep-16
p = fmaf (p, t, 3.93552968e-4f); // 0x1.9cab92p-12
p = fmaf (p, t, 3.02698812e-3f); // 0x1.8cc0dep-9
p = fmaf (p, t, 4.83185798e-3f); // 0x1.3ca920p-8
p = fmaf (p, t, -2.64646143e-1f); // -0x1.0eff66p-2
p = fmaf (p, t, 8.40016484e-1f); // 0x1.ae16a4p-1
} else { // maximum ulp error = 2.35002
p = 5.43877832e-9f; // 0x1.75c000p-28
p = fmaf (p, t, 1.43285448e-7f); // 0x1.33b402p-23
p = fmaf (p, t, 1.22774793e-6f); // 0x1.499232p-20
p = fmaf (p, t, 1.12963626e-7f); // 0x1.e52cd2p-24
p = fmaf (p, t, -5.61530760e-5f); // -0x1.d70bd0p-15
p = fmaf (p, t, -1.47697632e-4f); // -0x1.35be90p-13
p = fmaf (p, t, 2.31468678e-3f); // 0x1.2f6400p-9
p = fmaf (p, t, 1.15392581e-2f); // 0x1.7a1e50p-7
p = fmaf (p, t, -2.32015476e-1f); // -0x1.db2aeep-3
p = fmaf (p, t, 8.86226892e-1f); // 0x1.c5bf88p-1
}
r = a * p;
return r;
}
/* compute natural logarithm with a maximum error of 0.85089 ulp */
float my_logf (float a)
{
float i, m, r, s, t;
int e;
m = frexpf (a, &e);
if (m < 0.666666667f) { // 0x1.555556p-1
m = m + m;
e = e - 1;
}
i = (float)e;
/* m in [2/3, 4/3] */
m = m - 1.0f;
s = m * m;
/* Compute log1p(m) for m in [-1/3, 1/3] */
r = -0.130310059f; // -0x1.0ae000p-3
t = 0.140869141f; // 0x1.208000p-3
r = fmaf (r, s, -0.121484190f); // -0x1.f19968p-4
t = fmaf (t, s, 0.139814854f); // 0x1.1e5740p-3
r = fmaf (r, s, -0.166846052f); // -0x1.55b362p-3
t = fmaf (t, s, 0.200120345f); // 0x1.99d8b2p-3
r = fmaf (r, s, -0.249996200f); // -0x1.fffe02p-3
r = fmaf (t, m, r);
r = fmaf (r, m, 0.333331972f); // 0x1.5554fap-2
r = fmaf (r, m, -0.500000000f); // -0x1.000000p-1
r = fmaf (r, s, m);
r = fmaf (i, 0.693147182f, r); // 0x1.62e430p-1 // log(2)
if (!((a > 0.0f) && (a <= 3.40282346e+38f))) { // 0x1.fffffep+127
r = a + a; // silence NaNs if necessary
if (a < 0.0f) r = ( 0.0f / 0.0f); // NaN
if (a == 0.0f) r = (-1.0f / 0.0f); // -Inf
}
return r;
}
Quick & dirty, tolerance under +-6e-3. Work based on "A handy approximation for the error function and its inverse" by Sergei Winitzki.
C/C++ CODE:
float myErfInv2(float x){
float tt1, tt2, lnx, sgn;
sgn = (x < 0) ? -1.0f : 1.0f;
x = (1 - x)*(1 + x); // x = 1 - x*x;
lnx = logf(x);
tt1 = 2/(PI*0.147) + 0.5f * lnx;
tt2 = 1/(0.147) * lnx;
return(sgn*sqrtf(-tt1 + sqrtf(tt1*tt1 - tt2)));
}
MATLAB sanity check:
clear all, close all, clc
x = linspace(-1, 1,10000);
% x = 1 - logspace(-8,-15,1000);
a = 0.15449436008930206298828125;
% a = 0.147;
u = log(1-x.^2);
u1 = 2/(pi*a) + u/2; u2 = u/a;
y = sign(x).*sqrt(-u1+sqrt(u1.^2 - u2));
f = erfinv(x); axis equal
figure(1);
plot(x, [y; f]); legend('Approx. erf(x)', 'erf(x)')
figure(2);
e = f-y;
plot(x, e);
MATLAB Plots:
I don't think it's a standard implementation in <math.h>, but there are other C math libraries that have implement the inverse error function erfinv(x), that you can use.
Also quick and dirty: if less precision is allowed than I share my own approximation with the inverse hyperbolic tangent - the parameters are sought by monte carle simulation where all random values are between the range of 0.5 and 1.5:
p1 = 1.4872301551536515
p2 = 0.5739159012216655
p3 = 0.5803635928651558
atanh( p^( 1 / p3 ) ) / p2 )^( 1 / p1 )
This comes from the algebraic reordering of my erf function approximation with the hyperbolic tangent, where the RMSE error is 0.000367354 for x between 1 and 4:
tanh( x^p1 * p2 )^p3
I wrote another method that uses the fast-converging Newton-Rhapson method, which is an iterative method to find the root of a function. It starts with an initial guess and then iteratively improves the guess by using the derivative of the function. The Newton-Raphson method requires the function, its derivative, an initial guess and a stopping criteria.
In this case, the function we are trying to find the root of is erf(x) - x. And the derivative of this function is 2.0 / sqrt(pi) * exp(-x**2). The initial guess is the input value for x. The stopping criteria is a tolerance value, in this case it's 1.0e-16. Here is the code:
/*
============================================
Compile and execute with:
$ gcc inverf.c -o inverf -lm
$ ./inverf
============================================
*/
#include <stdio.h>
#include <math.h>
int main() {
double x, result, fx, dfx, dx, xold;
double tolerance = 1.0e-16;
double pi = 4.0 * atan(1.0);
int iteration, i;
// input value for x
printf("Calculator for inverse error function.\n");
printf("Enter the value for x: ");
scanf("%lf", &x);
// check the input value is between -1 and 1
if (x < -1.0 || x > 1.0) {
printf("Invalid input, x must be between -1 and 1.");
return 0;
}
// initial guess
result = x;
xold = 0.0;
iteration = 0;
// iterate until the solution converges
do {
xold = result;
fx = erf(result) - x;
dfx = 2.0 / sqrt(pi) * exp(-pow(result, 2.0));
dx = fx / dfx;
// update the solution
result = result - dx;
iteration = iteration + 1;
} while (fabs(result - xold) >= tolerance);
// output the result
printf("The inverse error function of %lf is %lf\n", x, result);
printf("Number of iterations: %d\n", iteration);
return 0;
}
In the terminal it should look something like this:
Calculator for inverse error function.
Enter the value for x: 0.5
The inverse error function of 0.500000 is 0.476936
Number of iterations: 5
I need to find the smallest number of steps it takes to get between two points in a grid. If you are positioned at the center, and you can only move in 8 directions to the integer points surrounding you, then what's the least number of steps to take to get to a destination point?
I have a solution for this, but it is massively ugly and I'm honestly a bit ashamed of it:
/**
* #details Each point in the graph is a linear combination of these vectors:
*
* [x] = a[0] + b[1] + c[1] + d[ 1]
* [y] [1] [1] [0] [-1]
*
* EQ1: c + b + d == x
* EQ2: a + b - d == y
*
* Any path can be simplified to involve at most two of these variables, so we
* can solve the linear equations above with the knowledge that at least two of
* a, b, c, and d are 0. The sum of the absolute value of the coefficients is
* the number of steps taken in the grid.
*/
unsigned min_distance(point_t start, point_t goal)
{
int a, b, c, d;
int swap, steps;
int x, y;
x = goal.x - start.x;
y = goal.y - start.y;
/* Possible simple shortcuts */
if (x == 0 || y == 0) {
steps = abs(x) + abs(y);
} else if (abs(x) == abs(y)) {
steps = abs(y);
} else {
b = x, a = y - b;
steps = abs(a) + abs(b);
c = x, a = y;
swap = abs(a) + abs(c);
if (steps > swap)
steps = swap;
d = x, a = y + d;
swap = abs(a) + abs(d);
if (steps > swap)
steps = swap;
b = y, c = x - b;
swap = abs(b) + abs(c);
if (steps > swap)
steps = swap;
b = (x + y) / 2, d = b - y;
swap = abs(b) + abs(d);
if ((x + y) % 2 == 0 && steps > swap)
steps = swap;
d = -y, c = x - d;
swap = abs(c) + abs(d);
if (steps > swap)
steps = swap;
}
return steps;
}
The comment at the top explains the actual algorithm: represent each valid step as a column vector in a matrix, then find the smallest solution to the resulting system of linear equations.
In this case I saw the best answer would use at most two variables, and solved the equation six times by setting different combinations of the variables to 0. That's too specific! I want to be able to change the rules about which steps are valid and still be able to find the min distance.
EDIT: I realize this is a very poor simple example of what I'm trying to do, because of how easy it is to simplify the problem in this case. The goal is to calculate the number of steps given arbitrary stepping rules. If it the allowed steps instead looked like this then I'd start with a different matrix ([0 2 3 4 x; 1 2 0 -4 y]) and find the least solution to a different system of equations (2b + 3c + 4d = x, a + 2b - 4d = y). I'm actually trying to write a procedure that can work with any set of vectors to find the minimum number of steps.
...Any advice or criticism?
alright. I have the Euclidean division like this : a = b * q + r
I know that to get r, I can do the modulo : a % b
but how do I get q ? // doesn't seem to work.
Using Euclidean division
If a = 7 and b = 3, then q = 2 and r = 1, since 7 = 3 × 2 + 1.
If a = 7 and b = −3, then q = −2 and r = 1, since 7 = −3 × (−2) + 1.
If a = −7 and b = 3, then q = −3 and r = 2, since −7 = 3 × (−3) + 2.
If a = −7 and b = −3, then q = 3 and r = 2, since −7 = −3 × 3 + 2.
Likely a more simple solution is available.
int Ediv(int a, int b) {
printf("a:%2d / b:%2d = ", a,b);
int r = a % b;
if (r < 0) r += abs(b);
printf("r:%2d ", r);
return (a - r) / b;
}
void Etest() {
printf("q:%2d\n", Ediv(7,3));
printf("q:%2d\n", Ediv(7,-3));
printf("q:%2d\n", Ediv(-7,3));
printf("q:%2d\n", Ediv(-7,-3));
}
a: 7 / b: 3 = r: 1 q: 2
a: 7 / b:-3 = r: 1 q:-2
a:-7 / b: 3 = r: 2 q:-3
a:-7 / b:-3 = r: 2 q: 3
OP asserts "I know that to get r, I can do the modulo : a % b". This fails when a is negative.
Further, % is the "remainder operator". In C, the difference between Euclidean remainder and modulo occurs when a is negative. Remainder calculation for the modulo operation
If a and b are integers, just use integer division, /.
chux's answer is a good one, as it's the one that correctly understands the question (unlike the accepted one). Daan Leijen's Division and Modulus for Computer Scientists also provides an algorithm with proof:
/* Euclidean quotient */
long quoE(long numer, long denom) {
/* The C99 and C++11 languages define both of these as truncating. */
long q = numer / denom;
long r = numer % denom;
if (r < 0) {
if (denom > 0)
q = q - 1; // r = r + denom;
else {
q = q + 1; // r = r - denom;
}
return q;
}
This one is trivially equivalent to chux's algorithm. It looks a bit more verbose, but it might take one fewer div instruction on x86, as the instruction performs a "divmod" -- returning q and r at the same time.
Take an example:
7 = 3*2 + 1
2 (q) can be obtained by 7/3, i.e, a/b = q.