In SSE there is a function _mm_cvtepi32_ps(__m128i input) which takes input vector of 32 bits wide signed integers (int32_t) and converts them into floats.
Now, I want to interpret input integers as not signed. But there is no function _mm_cvtepu32_ps and I could not find an implementation of one. Do you know where I can find such a function or at least give a hint on the implementation?
To illustrate the the difference in results:
unsigned int a = 2480160505; // 10010011 11010100 00111110 11111001
float a1 = a; // 01001111 00010011 11010100 00111111;
float a2 = (signed int)a; // 11001110 11011000 01010111 10000010
With Paul R's solution and with my previous solution
the difference between the rounded floating point and the original integer is less than or equal to
0.75 ULP (Unit in the Last Place). In these methods
at two places rounding may occur: in _mm_cvtepi32_ps and
in _mm_add_ps. This leads to results that are not as accurate as possible for some inputs.
For example, with Paul R's method 0x2000003=33554435 is converted to 33554432.0, but 33554436.0
also exists as a float, which would have been better here.
My previous solution suffers from similar inaccuracies.
Such inaccurate results may also occur with compiler generated code, see here.
Following the approach of gcc (see Peter Cordes' answer to that other SO question), an accurate conversion within 0.5 ULP is obtained:
inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
__m128i msk_lo = _mm_set1_epi32(0xFFFF);
__m128 cnst65536f= _mm_set1_ps(65536.0f);
__m128i v_lo = _mm_and_si128(v,msk_lo); /* extract the 16 lowest significant bits of v */
__m128i v_hi = _mm_srli_epi32(v,16); /* 16 most significant bits of v */
__m128 v_lo_flt = _mm_cvtepi32_ps(v_lo); /* No rounding */
__m128 v_hi_flt = _mm_cvtepi32_ps(v_hi); /* No rounding */
v_hi_flt = _mm_mul_ps(cnst65536f,v_hi_flt); /* No rounding */
return _mm_add_ps(v_hi_flt,v_lo_flt); /* Rounding may occur here, mul and add may fuse to fma for haswell and newer */
} /* _mm_add_ps is guaranteed to give results with an error of at most 0.5 ULP */
Note that other high bits/low bits partitions are possible as long as _mm_cvt_ps can convert
both pieces to floats without rounding.
For example, a partition with 20 high bits and 12 low bits will work equally well.
This functionality exists in AVX-512, but if you can't wait until then the only thing I can suggest is to convert the unsigned int input values into pairs of smaller values, convert these, and then add them together again, e.g.
inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
__m128i v2 = _mm_srli_epi32(v, 1); // v2 = v / 2
__m128i v1 = _mm_sub_epi32(v, v2); // v1 = v - (v / 2)
__m128 v2f = _mm_cvtepi32_ps(v2);
__m128 v1f = _mm_cvtepi32_ps(v1);
return _mm_add_ps(v2f, v1f);
}
UPDATE
As noted by #wim in his answer, the above solution fails for an input value of UINT_MAX. Here is a more robust, but slightly less efficient solution, which should work for the full uint32_t input range:
inline __m128 _mm_cvtepu32_ps(const __m128i v)
{
__m128i v2 = _mm_srli_epi32(v, 1); // v2 = v / 2
__m128i v1 = _mm_and_si128(v, _mm_set1_epi32(1)); // v1 = v & 1
__m128 v2f = _mm_cvtepi32_ps(v2);
__m128 v1f = _mm_cvtepi32_ps(v1);
return _mm_add_ps(_mm_add_ps(v2f, v2f), v1f); // return 2 * v2 + v1
}
I think Paul's answer is nice, but it fails for v=4294967295U (=2^32-1). In that case v2=2^31-1 and v1=2^31. Intrinsic _mm_cvtepi32_ps converts 2^31 to -2.14748365E9 . v2=2^31-1 is converted to 2.14748365E9 and consequently _mm_add_ps returns 0 (due to rounding v1f and v2f are the exact opposite of each other).
The idea of the solution below is to copy the most significant bit of v to v_high. The other bits of v are copied to v_low. v_high is converted to 0 or 2.14748365E9 .
inline __m128 _mm_cvtepu32_v3_ps(const __m128i v)
{
__m128i msk0=_mm_set1_epi32(0x7FFFFFFF);
__m128i zero=_mm_xor_si128(msk0,msk0);
__m128i cnst2_31=_mm_set1_epi32(0x4F000000); /* IEEE representation of float 2^31 */
__m128i v_high=_mm_andnot_si128(msk0,v);
__m128i v_low=_mm_and_si128(msk0,v);
__m128 v_lowf=_mm_cvtepi32_ps(v_low);
__m128i msk1=_mm_cmpeq_epi32(v_high,zero);
__m128 v_highf=_mm_castsi128_ps(_mm_andnot_si128(msk1,cnst2_31));
__m128 v_sum=_mm_add_ps(v_lowf,v_highf);
return v_sum;
}
Update
It was possible to reduce the number of instructions:
inline __m128 _mm_cvtepu32_v4_ps(const __m128i v)
{
__m128i msk0=_mm_set1_epi32(0x7FFFFFFF);
__m128i cnst2_31=_mm_set1_epi32(0x4F000000);
__m128i msk1=_mm_srai_epi32(v,31);
__m128i v_low=_mm_and_si128(msk0,v);
__m128 v_lowf=_mm_cvtepi32_ps(v_low);
__m128 v_highf=_mm_castsi128_ps(_mm_and_si128(msk1,cnst2_31));
__m128 v_sum=_mm_add_ps(v_lowf,v_highf);
return v_sum;
}
Intrinsic _mm_srai_epi32 shifts the most significant bit of v to the right, while shifting in sign bits, which turns out to be quite useful here.
Related
I wonder if there is a fast way of multiplying int8 arrays, i.e.
for(i = 0; i < n; ++i)
z[i] = x * y[i];
I see that the Intel intrinsics guide lists several SIMD instructions, such as _mm_mulhi_epi16 and _mm_mullo_epi16 that do something like this for int16. Is there something similar for int8 that I'm missing?
Breaking the input into low & hi, one can
__m128i const kff00ff00 = _mm_set1_epi32(0xff00ff00);
__m128i lo = _mm_mullo_epi16(y, x);
__m128i hi = _mm_mullo_epi16(_mm_and_si128(y, kff00ff00), x);
__m128i z = _mm_blendv_epi8(lo, hi, kff00ff00);
AFAIK, the high bits YY of the YYyy|YYyy|YYyy|YYyy multiplied by 00xx|00xx|00xx|00xx do not interfere with the low 8 bits ??ll, and likewise the product of YY00|YY00 * 00xx|00xx produces the correct 8 bit product at HH00. These two results at the correct alignment need to be blended.
__m128i x = _mm_set1_epi16(scalar_x);, and __m128i y = _mm_loadu_si128(...);
An alternative is to use shufb calculating LutLo[y & 15] + LutHi[y >> 4], where unfortunately the shift must be also emulated by _mm_and_si128(_mm_srli_epi16(y,4),_mm_set1_epi8(15)).
I'm wondering how load and store efficiently vars when working with SSE2.
In this example, I want to bench the pclmulqdq instruction (carry less multiplication, useful for polynomial arithmetic) vs plain C function, so I need the same "calling convention" that a standard function.
a and b are 16 significant bits, result will have 32 significant bits
#include <wmmintrin.h>
int GFpoly_mul_i(int a, int b) {
__m128i xa = _mm_loadu_si128( (__m128i*) a);
__m128i xb = _mm_loadu_si128((__m128i*) b);
__m128i r = _mm_clmulepi64_si128(xa, xb, 0);
_MM_ALIGN16 int result[4];
__m128i* ptr_result = (__m128i*)result;
_mm_store_si128(ptr_result, r);
return result[0];
}
Extracting the 32bit integer from the lowest part of a vector can be done easily with _mm_cvtsi128_si32:
return _mm_cvtsi128_si32(r);
Loading a 32bit integer into the lowest part of a vector can be done with the "opposite" operation, _mm_cvtsi32_si128:
__m128i xa = _mm_cvtsi32_si128(a);
Loading the integer a into a vector cannot be done with _mm_loadu_si128( (__m128i*) a), this would cast a to a pointer and dereference it (reading a 128bit vector), but a is just an integer value and doesn't point anywhere useful, except perhaps by accident.
I'm totally new to SSE programming, but have an Intel Core i7 processor.
Basically, I want to take 4 32-bit unsigned integers and cube them all (raise to the power of 3) at once. It is my understanding that the SIMD functionality of SSE and its successors make this possible, but how in the world do I go about doing it? Preferably in C but I could manage assembly if necessary.
Edit to make clear my final goal:
Then, I want to add all the cubes together to come up with a single number.
Background: I'm just trying to use SSE to optimize figuring out if a number is an Armstrong number (a three-digit number whose sum of each digit cubed is the same as the number itself). An example is 153. There seems to be no way to do this other than brute force. These are a subset of Narcissistic numbers whose sum of all digits to the power of the length of the decimal number are equal to number itself. Hopefully, I'd like to eventually expand it to be more flexible, to start I'm just doing the Armstrong numbers. As you might imagine, this came up on another site and a few of us are trying to optimize the hell out of it. By taking your ideas and my own research, I came up with this code:
#include <stdio.h>
#include <smmintrin.h> // SSE 4.1
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
int main(int argc, const char * argv[]) {
for (unsigned int i = 1; i <= 500; i++) {
unsigned int firstDigit = i / 100;
unsigned int secondDigit = (i - firstDigit * 100) / 10;
unsigned int thirdDigit = (i - firstDigit * 100 - secondDigit * 10);
__m128i v = _mm_setr_epi32(0, firstDigit, secondDigit, thirdDigit);
__m128 v3 = (__m128) vcube(v);
v3 = _mm_hadd_ps(v3, v3);
v3 = _mm_hadd_ps(v3, v3);
if (_mm_extract_epi32((__m128i) v3, 0) == i)
printf ("%03d is an Armstrong number\n", i);
}
return 0;
}
Note: I had to do some type coercions to get it to compile in some systems (Solaris, at least some Linux).
So this works, but maybe it could be streamlined. Sorry I didn't post the whole task, but I was trying to break it down into steps and I wanted to make sure each digit was correctly cubed.
(END EDIT)
Thank you!
Edit: I guess I should add I'm running Mac OS X Sierra.
EDIT AGAIN:
So, let's say I make these all these unsigned shorts instead of unsigned ints and add more digits, how do I add them together when a short may not be able to hold the sum of all the digits? Is there a way to add them and store in a vector of larger variables if you know what I mean, or a plain larger number such as a UInt64?
Sorry for all the questions, but like I said I'm totally new at vector processing even though I had access to it since my first Mac G4.
If your input values are in the range 0..1625 (so that the result fits in 32 bits) then you can use _mm_mullo_epi32:
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
Demo:
#include <stdio.h>
#include <smmintrin.h> // SSE 4.1
__m128i vcube(const __m128i v)
{
return _mm_mullo_epi32(v, _mm_mullo_epi32(v, v));
}
int main()
{
__m128i v = _mm_setr_epi32(0, 1, 1000, 1625);
__m128i v3 = vcube(v);
printf("%vlu => %vlu\n", v, v3);
return 0;
}
Compile and test:
$ gcc -Wall -Wno-format-invalid-specifier -Wno-format-extra-args -msse4 vcube.c && ./a.out
0 1 1000 1625 => 0 1 1000000000 4291015625
For x<=2642245 you can do x*x*x using the foo_SSE function below using SSE4.1. This takes two 32-bit unsigned intergs as input packed into the upper and lower 64-bits of a SSE register and outputs two 64-bit integers.
#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>
__m128i foo_SSE(__m128i x) {
__m128i mask = _mm_set_epi32(-1, 0, -1, 0);
__m128i x2 =_mm_shuffle_epi32(x, 0x80);
__m128i t0 = _mm_mul_epu32(x,x);
__m128i t1 = _mm_mul_epu32(t0,x);
__m128i t2 = _mm_mullo_epi32(t0,x2);
__m128i t3 = _mm_and_si128(t2, mask);
__m128i t4 = _mm_add_epi32(t3, t1);
return t4;
}
int main(void) {
uint64_t k1 = 100000;
uint64_t k2 = 2642245;
__m128i x = _mm_setr_epi32(k1, 0, k2, 0);
uint64_t t[2];
_mm_store_si128((__m128i*)t, foo_SSE(x));
printf("%20" PRIu64 " ", t[0]);
printf("%20" PRIu64 "\n", t[1]);
printf("%20" PRIu64 " ", k1*k1*k1);
printf("%20" PRIu64 "\n", k2*k2*k2);
}
This can probably be improved a bit. I'm a little out of practice.
To get a quick overview about the 3 main stages (loading, operating, storing) see the following snippet. For integers e0 and e1:
#include "emmintrin.h"
__m128i result __attribute__((aligned(16)));
__m128i x = _mm_setr_epi32(0, e1, 0, e0);
__m128i cube = _mm_mul_epu32(x, _mm_mul_epu32(x, x));
_mm_store_si128(&result, cube);
The _mm_mul_epu32 takes the even multiples of 32bits of two _m128i registers, multiplies them and puts the result as 2-tuple of 64bits into the result register.
To get them out of there access either access them through a cast or use your compiler's convenience definition of __m128i, e.g. for icc:
printf("%llu %llu\n", result.m128i_i64[0], result.m128i_i64[1]); /* msc style */
Note: I'm using the Intel Intrinsics guide for SSE primitives.
Edited for clarity about what the code actually does.
I am currently looking into ways of using the fast single-precision floating-point reciprocal capability of various modern processors to compute a starting approximation for a 64-bit unsigned integer division based on fixed-point Newton-Raphson iterations. It requires computation of 264 / divisor, as accurately as possible, where the initial approximation must be smaller than, or equal to, the mathematical result, based on the requirements of the following fixed-point iterations. This means this computation needs to provide an underestimate. I currently have the following code, which works well, based on extensive testing:
#include <stdint.h> // import uint64_t
#include <math.h> // import nextafterf()
uint64_t divisor, recip;
float r, s, t;
t = uint64_to_float_ru (divisor); // ensure t >= divisor
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; // underestimate of 2**64 / divisor
While this code is functional, it isn't exactly fast on most platforms. One obvious improvement, which requires a bit of machine-specific code, is to replace the division r = 1.0f / t with code that makes use of a fast floating-point reciprocal provided by the hardware. This can be augmented with iteration to produce a result that is within 1 ulp of the mathematical result, so an underestimate is produced in the context of the existing code. A sample implementation for x86_64 would be:
#include <xmmintrin.h>
/* Compute 1.0f/a almost correctly rounded. Halley iteration with cubic convergence */
inline float fast_recip_f32 (float a)
{
__m128 t;
float e, r;
t = _mm_set_ss (a);
t = _mm_rcp_ss (t);
_mm_store_ss (&r, t);
e = fmaf (r, -a, 1.0f);
e = fmaf (e, e, e);
r = fmaf (e, r, r);
return r;
}
Implementations of nextafterf() are typically not performance optimized. On platforms where there are means to quickly re-interprete an IEEE 754 binary32 into an int32 and vice versa, via intrinsics float_as_int() and int_as_float(), we can combine use of nextafterf() and scaling as follows:
s = int_as_float (float_as_int (r) + 0x1fffffff);
Assuming these approaches are possible on a given platform, this leaves us with the conversions between float and uint64_t as major obstacles. Most platforms don't provide an instruction that performs a conversion from uint64_t to float with static rounding mode (here: towards positive infinity = up), and some don't offer any instructions to convert between uint64_t and floating-point types, making this a performance bottleneck.
t = uint64_to_float_ru (divisor);
r = fast_recip_f32 (t);
s = int_as_float (float_as_int (r) + 0x1fffffff);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */
A portable, but slow, implementation of uint64_to_float_ru uses dynamic changes to FPU rounding mode:
#include <fenv.h>
#pragma STDC FENV_ACCESS ON
float uint64_to_float_ru (uint64_t a)
{
float res;
int curr_mode = fegetround ();
fesetround (FE_UPWARD);
res = (float)a;
fesetround (curr_mode);
return res;
}
I have looked into various splitting and bit-twiddling approaches to deal with the conversions (e.g. do the rounding on the integer side, then use a normal conversion to float which uses the IEEE 754 rounding mode round-to-nearest-or-even), but the overhead this creates makes this computation via fast floating-point reciprocal unappealing from a performance perspective. As it stands, it looks like I would be better off generating a starting approximation by using a classical LUT with interpolation, or a fixed-point polynomial approximation, and follow those up with a 32-bit fixed-point Newton-Raphson step.
Are there ways to improve the efficiency of my current approach? Portable and semi-portable ways involving intrinsics for specific platforms would be of interest (in particular for x86 and ARM as the currently dominant CPU architectures). Compiling for x86_64 using the Intel compiler at very high optimization (/O3 /QxCORE-AVX2 /Qprec-div-) the computation of the initial approximation takes more instructions than the iteration, which takes about 20 instructions. Below is the complete division code for reference, showing the approximation in context.
uint64_t udiv64 (uint64_t dividend, uint64_t divisor)
{
uint64_t temp, quot, rem, recip, neg_divisor = 0ULL - divisor;
float r, s, t;
/* compute initial approximation for reciprocal; must be underestimate! */
t = uint64_to_float_ru (divisor);
r = 1.0f / t;
s = 0x1.0p64f * nextafterf (r, 0.0f);
recip = (uint64_t)s; /* underestimate of 2**64 / divisor */
/* perform Halley iteration with cubic convergence to refine reciprocal */
temp = neg_divisor * recip;
temp = umul64hi (temp, temp) + temp;
recip = umul64hi (recip, temp) + recip;
/* compute preliminary quotient and remainder */
quot = umul64hi (dividend, recip);
rem = dividend - divisor * quot;
/* adjust quotient if too small; quotient off by 2 at most */
if (rem >= divisor) quot += ((rem - divisor) >= divisor) ? 2 : 1;
/* handle division by zero */
if (divisor == 0ULL) quot = ~0ULL;
return quot;
}
umul64hi() would generally map to a platform-specific intrinsic, or a bit of inline assembly code. On x86_64 I currently use this implementation:
inline uint64_t umul64hi (uint64_t a, uint64_t b)
{
uint64_t res;
__asm__ (
"movq %1, %%rax;\n\t" // rax = a
"mulq %2;\n\t" // rdx:rax = a * b
"movq %%rdx, %0;\n\t" // res = (a * b)<63:32>
: "=rm" (res)
: "rm"(a), "rm"(b)
: "%rax", "%rdx");
return res;
}
This solution combines two ideas:
You can convert to floating point by simply reinterpreting the bits as floating point and subtracting a constant, so long as the number is within a particular range. So add a constant, reinterpret, and then subtract that constant. This will give a truncated result (which is therefore always less than or equal the desired value).
You can approximate reciprocal by negating both the exponent and the mantissa. This may be achieved by interpreting the bits as int.
Option 1 here only works in a certain range, so we check the range and adjust the constants used. This works in 64 bits because the desired float only has 23 bits of precision.
The result in this code will be double, but converting to float is trivial, and can be done on the bits or directly, depending on hardware.
After this you'd want to do the Newton-Raphson iteration(s).
Much of this code simply converts to magic numbers.
double
u64tod_inv( uint64_t u64 ) {
__asm__( "#annot0" );
union {
double f;
struct {
unsigned long m:52; // careful here with endianess
unsigned long x:11;
unsigned long s:1;
} u64;
uint64_t u64i;
} z,
magic0 = { .u64 = { 0, (1<<10)-1 + 52, 0 } },
magic1 = { .u64 = { 0, (1<<10)-1 + (52+12), 0 } },
magic2 = { .u64 = { 0, 2046, 0 } };
__asm__( "#annot1" );
if( u64 < (1UL << 52UL ) ) {
z.u64i = u64 + magic0.u64i;
z.f -= magic0.f;
} else {
z.u64i = ( u64 >> 12 ) + magic1.u64i;
z.f -= magic1.f;
}
__asm__( "#annot2" );
z.u64i = magic2.u64i - z.u64i;
return z.f;
}
Compiling this on an Intel core 7 gives a number of instructions (and a branch), but, of course, no multiplies or divides at all. If the casts between int and double are fast this should run pretty quickly.
I suspect float (with only 23 bits of precision) will require more than 1 or 2 Newton-Raphson iterations to get the accuracy you want, but I haven't done the math...
I made a function to compute a fixed-point approximation of atan2(y, x). The problem is that of the ~83 cycles it takes to run the whole function, 70 cycles (compiling with gcc 4.9.1 mingw-w64 -O3 on an AMD FX-6100) are taken entirely by a simple 64-bit integer division! And sadly none of the terms of that division are constant. Can I speed up the division itself? Is there any way I can remove it?
I think I need this division because since I approximate atan2(y, x) with a 1D lookup table I need to normalise the distance of the point represented by x,y to something like a unit circle or unit square (I chose a unit 'diamond' which is a unit square rotated by 45°, which gives a pretty even precision across the positive quadrant). So the division finds (|y|-|x|) / (|y|+|x|). Note that the divisor is in 32-bits while the numerator is a 32-bit number shifted 29 bits right so that the result of the division has 29 fractional bits. Also using floating point division is not an option as this function is required not to use floating point arithmetic.
Any ideas? I can't think of anything to improve this (and I can't figure out why it takes 70 cycles just for a division). Here's the full function for reference:
int32_t fpatan2(int32_t y, int32_t x) // does the equivalent of atan2(y, x)/2pi, y and x are integers, not fixed point
{
#include "fpatan.h" // includes the atan LUT as generated by tablegen.exe, the entry bit precision (prec), LUT size power (lutsp) and how many max bits |b-a| takes (abdp)
const uint32_t outfmt = 32; // final output format in s0.outfmt
const uint32_t ofs=30-outfmt, ds=29, ish=ds-lutsp, ip=30-prec, tp=30+abdp-prec, tmask = (1<<ish)-1, tbd=(ish-tp); // ds is the division shift, the shift for the index, bit precision of the interpolation, the mask, the precision for t and how to shift from p to t
const uint32_t halfof = 1UL<<(outfmt-1); // represents 0.5 in the output format, which since it is in turns means half a circle
const uint32_t pds=ds-lutsp; // division shift and post-division shift
uint32_t lutind, p, t, d;
int32_t a, b, xa, ya, xs, ys, div, r;
xs = x >> 31; // equivalent of fabs()
xa = (x^xs) - xs;
ys = y >> 31;
ya = (y^ys) - ys;
d = ya+xa;
if (d==0) // if both y and x are 0 then they add up to 0 and we must return 0
return 0;
// the following does 0.5 * (1. - (y-x) / (y+x))
// (y+x) is u1.31, (y-x) is s0.31, div is in s1.29
div = ((int64_t) (ya-xa)<<ds) / d; // '/d' normalises distance to the unit diamond, immediate result of division is always <= +/-1^ds
p = ((1UL<<ds) - div) >> 1; // before shift the format is s2.29. position in u1.29
lutind = p >> ish; // index for the LUT
t = (p & tmask) >> tbd; // interpolator between two LUT entries
a = fpatan_lut[lutind];
b = fpatan_lut[lutind+1];
r = (((b-a) * (int32_t) t) >> abdp) + (a<<ip); // linear interpolation of a and b by t in s0.32 format
// Quadrants
if (xs) // if x was negative
r = halfof - r; // r = 0.5 - r
r = (r^ys) - ys; // if y was negative then r is negated
return r;
}
Unfortunately a 70 cycles latency is typical for a 64-bit integer division on x86 CPUs. Floating point division typically has about half the latency or less. The increased cost comes from the fact modern CPUs only have dividers in their floating point execution units (they're very expensive in terms silicon area), so need to convert the integers to floating point and back again. So just substituting a floating division in place of the integer one isn't likely to help. You'll need to refactor your code to use floating point instead to take advantage of faster floating point division.
If you're able to refactor your code you might also be able to benefit from the approximate floating-point reciprocal instruction RCPSS, if you don't need an exact answer. It has a latency of around 5 cycles.
Based on #Iwillnotexist Idonotexist's suggestion to use lzcnt, reciprocity and multiplication I implemented a division function that runs in about 23.3 cycles and with a pretty great precision of 1 part in 19 million with a 1.5 kB LUT, e.g. one of the worst cases being for 1428769848 / 1080138864 you might get 1.3227648959 instead of 1.3227649663.
I figured out an interesting technique while researching this, I was really struggling to think of something that could be fast and precise enough, as not even a quadratic approximation of 1/x in [0.5 , 1.0) combined with an interpolated difference LUT would do, then I had the idea of doing it the other way around so I made a lookup table that contains the quadratic coefficients that fit the curve on a short segment that represents 1/128th of the [0.5 , 1.0) curve, which gives you a very small error like so. And using the 7 most significant bits of what represents x in the [0.5 , 1.0) range as a LUT index I directly get the coefficients that work best for the segment that x falls into.
Here's the full code with the lookup tables ffo_lut.h and fpdiv.h:
#include "ffo_lut.h"
static INLINE int32_t log2_ffo32(uint32_t x) // returns the number of bits up to the most significant set bit so that 2^return > x >= 2^(return-1)
{
int32_t y;
y = x>>21; if (y) return ffo_lut[y]+21;
y = x>>10; if (y) return ffo_lut[y]+10;
return ffo_lut[x];
}
// Usage note: for fixed point inputs make outfmt = desired format + format of x - format of y
// The caller must make sure not to divide by 0. Division by 0 causes a crash by negative index table lookup
static INLINE int64_t fpdiv(int32_t y, int32_t x, int32_t outfmt) // ~23.3 cycles, max error (by division) 53.39e-9
{
#include "fpdiv.h" // includes the quadratic coefficients LUT (1.5 kB) as generated by tablegen.exe, the format (prec=27) and LUT size power (lutsp)
const int32_t *c;
int32_t xa, xs, p, sh;
uint32_t expon, frx, lutind;
const uint32_t ish = prec-lutsp-1, cfs = 31-prec, half = 1L<<(prec-1); // the shift for the index, the shift for 31-bit xa, the value of 0.5
int64_t out;
int64_t c0, c1, c2;
// turn x into xa (|x|) and sign of x (xs)
xs = x >> 31;
xa = (x^xs) - xs;
// decompose |x| into frx * 2^expon
expon = log2_ffo32(xa);
frx = (xa << (31-expon)) >> cfs; // the fractional part is now in 0.27 format
// lookup the 3 quadratic coefficients for c2*x^2 + c1*x + c0 then compute the result
lutind = (frx - half) >> ish; // range becomes [0, 2^26 - 1], in other words 0.26, then >> (26-lutsp) so the index is lutsp bits
lutind *= 3; // 3 entries for each index
c = &fpdiv_lut[lutind]; // c points to the correct c0, c1, c2
c0 = c[0]; c1 = c[1]; c2 = c[2];
p = (int64_t) frx * frx >> prec; // x^2
p = c2 * p >> prec; // c2 * x^2
p += c1 * frx >> prec; // + c1 * x
p += c0; // + c0, p = (1.0 , 2.0] in 2.27 format
// apply the necessary bit shifts and reapplies the original sign of x to make final result
sh = expon + prec - outfmt; // calculates the final needed shift
out = (int64_t) y * p; // format is s31 + 1.27 = s32.27
if (sh >= 0)
out >>= sh;
else
out <<= -sh;
out = (out^xs) - xs; // if x was negative then out is negated
return out;
}
I think ~23.3 cycles is about as good as it's gonna get for what it does, but if you have any ideas to shave a few cycles off please let me know.
As for the fpatan2() question the solution would be to replace this line:
div = ((int64_t) (ya-xa)<<ds) / d;
with that line:
div = fpdiv(ya-xa, d, ds);
Yours time hog instruction:
div = ((int64_t) (ya-xa)<<ds) / d;
exposes at least two issues. The first one is that you mask the builtin div function; but this is minor fact, could be never observed. The second one is that first, according to C language rules, both operands are converted to common type which is int64_t, and, then, division for this type is expanded into CPU instruction which divides 128-bit dividend by 64-bit divisor(!) Extract from assembly of cut-down version of your function:
21: 48 89 c2 mov %rax,%rdx
24: 48 c1 fa 3f sar $0x3f,%rdx ## this is sign bit extension
28: 48 f7 fe idiv %rsi
Yep, this division requires about 70 cycles and can't be optimized (well, really it can, but e.g. reverse divisor approach requires multiplication with 192-bit product). But if you are sure this division can be done with 64-bit dividend and 32-bit divisor and it won't overflow (quotient will fit into 32 bits) (I agree because ya-xa is always less by absolute value than ya+xa), this can be sped up using explicit assembly request:
uint64_t tmp_num = ((int64_t) (ya-xa))<<ds;
asm("idivl %[d]" :
[a] "=a" (div1) :
"[a]" (tmp_num), "d" (tmp_num >> 32), [d] "q" (d) :
"cc");
this is quick&dirty and shall be carefully verified, but I hope the idea is understood. The resulting assembly now looks like:
18: 48 98 cltq
1a: 48 c1 e0 1d shl $0x1d,%rax
1e: 48 89 c2 mov %rax,%rdx
21: 48 c1 ea 20 shr $0x20,%rdx
27: f7 f9 idiv %ecx
This seems to be huge advance because 64/32 division requires up to 25 clock cycles on Core family, according to Intel optimization manual, instead of 70 you see for 128/64 division.
More minor approvements can be added; e.g. shifts can be done yet more economically in parallel:
uint32_t diff = ya - xa;
uint32_t lowpart = diff << 29;
uint32_t highpart = diff >> 3;
asm("idivl %[d]" :
[a] "=a" (div1) :
"[a]" (lowpart), "d" (highpart), [d] "q" (d) :
"cc");
which results in:
18: 89 d0 mov %edx,%eax
1a: c1 e0 1d shl $0x1d,%eax
1d: c1 ea 03 shr $0x3,%edx
22: f7 f9 idiv %ecx
but this is minor fix, compared to the division-related one.
To conclude, I really doubt this routine is worth to be implemented in C language. The latter is quite ineconomical in integer arithmetic, requiring useless expansions and high part losses. The whole routine is worth to be moved to assembler.
Given an fpatan() implementation, you could simply implement fpatan2() in terms of that.
Assuming constants defined for pi abd pi/2:
int32_t fpatan2( int32_t y, int32_t x)
{
fixed theta ;
if( x == 0 )
{
theta = y > 0 ? fixed_half_pi : -fixed_half_pi ;
}
else
{
theta = fpatan( y / x ) ;
if( x < 0 )
{
theta += ( y < 0 ) ? -fixed_pi : fixed_pi ;
}
}
return theta ;
}
Note that fixed library implementations are easy to get very wrong. You might take a look at Optimizing Math-Intensive Applications with Fixed-Point Arithmetic. The use of C++ in the library under discussion makes the code much simpler, in most cases you can just replace the float or double keyword with fixed. It does not however have an atan2() implementation, the code above is adapted from my implementation for that library.