Fastest Integer Square Root in the least amount of instructions

Fastest Integer Square Root in the least amount of instructions - c

I am in a need of fast integer square root that does not involve any explicit division. The target RISC architecture can do operations like add, mul, sub, shift in one cycle (well - the operation's result is written in third cycle, really - but there's interleaving), so any Integer algorithm that uses these ops and is fast would be very appreciated.
This is what I have right now and I'm thinking that a binary search should be faster, since the following loop executes 16 times every single time (regardless of the value). I haven't debugged it extensively yet (but soon), so perhaps it's possible to have an early exit there:
unsigned short int int_sqrt32(unsigned int x)
{
unsigned short int res=0;
unsigned short int add= 0x8000;
int i;
for(i=0;i<16;i++)
{
unsigned short int temp=res | add;
unsigned int g2=temp*temp;
if (x>=g2)
{
res=temp;
}
add>>=1;
}
return res;
}
Looks like the current performance cost of the above [in the context of the target RISC] is a loop of 5 instructions (bitset, mul, compare, store, shift). Probably no space in cache to unroll fully (but this will be the prime candidate for a partial unroll [e.g. A loop of 4 rather than 16], for sure). So, the cost is 16*5 = 80 instructions (plus loop overhead, if not unrolled). Which, if fully interleaved, would cost only 80 (+2 for last instruction) cycles.
Can I get some other sqrt implementation (using only add, mul, bitshift, store/cmp) under 82 cycles?
FAQ:
Why don't you rely on the compiler to produce a good fast code?
There is no working C → RISC compiler for the platform. I will be porting the current reference C code into hand-written RISC ASM.
Did you profile the code to see if sqrt is actually a bottleneck?
No, there is no need for that. The target RISC chip is about twenty MHz, so every single instruction counts. The core loop (calculating the energy transfer form factor between the shooter and receiver patch), where this sqrt is used, will be run ~1,000 times each rendering frame (assuming it will be fast enough, of course), up to 60,000 per second, and roughly 1,000,000 times for whole demo.
Have you tried to optimize the algorithm to perhaps remove the sqrt?
Yes, I did that already. In fact, I got rid of 2 sqrts already and lots of divisions (removed or replaced by shifting). I can see a huge performance boost (compared to the reference float version) even on my gigahertz notebook.
What is the application?
It's a real-time progressive-refinement radiosity renderer for the compo demo. The idea is to have one shooting cycle each frame, so it would visibly converge and look better with each rendered frame (e.g. Up 60-times per second, though the SW rasterizer won't probably be that fast [but at least it can run on the other chip in parallel with the RISC - so if it takes 2-3 frames to render the scene, the RISC will have worked through 2-3 frames of radiosity data, in parallel]).
Why don't you work directly in target ASM?
Because radiosity is a slightly involved algorithm and I need the instant edit & continue debugging capability of Visual Studio. What I've done over the weekend in VS (couple hundred code changes to convert the floating-point math to integer-only) would take me 6 months on the target platform with only printing debugging".
Why can't you use a division?
Because it's 16-times slower on the target RISC than any of the following: mul, add, sub, shift, compare, load/store (which take just 1 cycle). So, it's used only when absolutely required (a couple times already, unfortunately, when shifting could not be used).
Can you use look-up tables?
The engine needs other LUTs already and copying from main RAM to RISC's little cache is prohibitively expensive (and definitely not each and every frame). But, I could perhaps spare 128-256 Bytes if it gave me at least a 100-200% boost for sqrt.
What's the range of the values for sqrt?
I managed to reduce it to mere unsigned 32-bit int (4,294,967,295)
EDIT1: I have ported two versions into the target RISC ASM, so I now have an exact count of ASM instructions during the execution (for the test scene).
Number of sqrt calls: 2,800.
Method1: The same method in this post (loop executing 16 times)
Method2: fred_sqrt (3c from http://www.azillionmonkeys.com/qed/sqroot.html)
Method1: 152.98 instructions per sqrt
Method2: 39.48 instructions per sqrt (with Final Rounding and 2 Newton iterations)
Method2: 21.01 instructions per sqrt (without Final Rounding and 2 Newton iterations)
Method2 uses LUT with 256 values, but since the target RISC can only use 32-bit access within its 4 KB cache, it actually takes 256*4 = 1 KB. But given its performance, I guess I will have to spare that 1 KB (out of 4).
Also, I have found out that there is NO visible visual artifact when I disable the Final rounding and two Newton iterations at the end (of Method2).
Meaning, the precision of that LUT is apparently good enough. Who knew...
The final cost is then 21.01 instructions per sqrt, which is almost ~order of magnitude faster than the very first solution. There's also possibility of reducing it further by sacrificing few of the 32 available registers for the constants used for the conditions and jump labels (each condition must fill 2 registers - one with the actual constant (only values less than 16 are allowed within CMPQ instruction, larger ones must be put into register) we are comparing against and second for the jump to the else label (the then jump is fall-through), as the direct relative jump is only possible within ~10 instructions (impossible with such large if-then-else chain, other than innermost 2 conditions).
EDIT2: ASM micro-optimizations
While benchmarking, I added counters for each of the 26 If.Then.Else codeblocks, to see if there aren't any blocks executed most often.
Turns out, that Blocks 0/10/11 are executed in 99.57%/99.57%/92.57% of cases. This means I can justify sacrificing 3 registers (out of 32) for those comparison constants (in those 3 blocks), e.g. r26 = $1.0000 r25 = $100.0000 r24 = $10.0000
This brought down the total instruction cost from 58,812 (avg:21.01) to 50,448 (avg:18.01)
So, now the average sqrt cost is just 18.01 ASM instructions (and there is no division!), though it will have to be inlined.
EDIT3: ASM micro-optimizations
Since we know that those 3 blocks (0/10/11) are executed most often, we can use local short jumps (16 Bytes in both directions, which is usually just couple of instructions (hence mostly useless), especially when the 6-byte MOVEI #jump_label, register is used during conditions) in those particular conditions. Of course, the Else condition will then incur additional 2 ops (that it would not have otherwise), but that's worth it. The block 10 will have to be swapped (Then block with Else block), which will make it harder to read and debug, but I documented the reasons profusely.
Now the total instruction cost (in test scene) is just 42,500 with an average of 15.18 ASM instructions per sqrt.
EDIT4: ASM micro-optimizations
Block 11 condition splits into innermost Blocks 12&13. It just so happens that those blocks don't need additional +1 math op, hence the local short jump can actually reach the Else block, if I merge bitshift right with the necessary bitshift left #2 (as all offsets within cache must be 32-bit). This saves on filling the jump register though I do need to sacrifice one more register r23 for the comparison value of $40.000.
The final cost is then 34,724 instructions with an average of 12.40 ASM instructions per sqrt.
I am also realizing that I could reshuffle the order of conditions (which will make the other range few ops more expensive, but that's happening for ~7% of cases only), favoring this particular range ($10.000, $40.000) first, saving on at least 1 or maybe even 2 conditions.
In which case, it should fall down to ~8.40 per sqrt.
I am realizing that the range depends directly on intensity of the light and the distance to the wall. Meaning, I have direct control over the RGB value of the light and distance from the wall. And while I would like the solution to be as generic as possible, given this realization (~12 ops per sqrt is mind-blowing), I will gladly sacrifice some flexibility in light colors if I can get sqrt this fast. Besides, there's maybe 10-15 different lights in whole demo, so I can simply find color combinations that eventually result in same sqrt range, but will get insanely fast sqrt. For sure, that's worth it to. And I still have a generic fallback (spanning entire int range) working just fine. Best of both worlds, really.

Have a look here.
For instance, at 3(a) there is this method, which is trivially adaptable to do a 64->32 bit square root, and also trivially transcribable to assembler:
/* by Jim Ulery */
static unsigned julery_isqrt(unsigned long val) {
unsigned long temp, g=0, b = 0x8000, bshft = 15;
do {
if (val >= (temp = (((g << 1) + b)<<bshft--))) {
g += b;
val -= temp;
}
} while (b >>= 1);
return g;
}
No divisions, no multiplications, bit shifts only. However, the time taken will be somewhat unpredictable particularly if you use a branch (on ARM RISC conditional instructions would work).
In general, this page lists ways to calculate square roots. If you happen to want to produce a fast inverse square root (i.e. x**(-0.5) ), or are just interested in amazing ways to optimise code, take a look at this, this and this.

This is the same as yours, but with fewer ops. (I count 9 ops in the loop in your code, including test and increment i in the for loop and 3 assignments, but perhaps some of those disappear when coded in ASM? There are 6 ops in the code below, if you count g*g>n as two (no assignment)).
int isqrt(int n) {
int g = 0x8000;
int c = 0x8000;
for (;;) {
if (g*g > n) {
g ^= c;
}
c >>= 1;
if (c == 0) {
return g;
}
g |= c;
}
}
I got it here. You can maybe eliminate a comparison if you unroll the loop and jump to the appropriate spot based on the highest non-zero bit in the input.
Update
I've been thinking more about using Newton's method. In theory, the number of bits of accuracy should double for each iteration. That means Newton's method is much worse than any of the other suggestions when there are few correct bits in the answer; however, the situation changes where there are a lot of correct bits in the answer. Considering that most suggestions seem to take 4 cycles per bit, that means that one iteration of Newton's method (16 cycles for division + 1 for addition + 1 for shift = 18 cycles) is not worthwhile unless it gives over 4 bits.
So, my suggestion is to build up 8 bits of the answer by one of the suggested methods (8*4 = 32 cycles) then perform one iteration of Newton's method (18 cycles) to double the number of bits to 16. That's a total of 50 cycles (plus maybe an extra 4 cycles to get 9 bits before applying Newton's method, plus maybe 2 cycles to overcome the overshoot occasionally experienced by Newton's method). That's a maximum of 56 cycles which as far as I can see rivals any of the other suggestions.
Second Update
I coded the hybrid algorithm idea. Newton's method itself has no overhead; you just apply and double the number of significant digits. The issue is to have a predictable number of significant digits before you apply Newton's method. For that, we need to figure out where the most significant bit of the answer will appear. Using a modification of the fast DeBruijn sequence method given by another poster, we can perform that calculation in about 12 cycles in my estimation. On the other hand, knowing the position of the msb of the answer speeds up all methods (average, not worst case), so it seems worthwhile no matter what.
After calculating the msb of the answer, I run a number of rounds of the algorithm suggested above, then finish it off with one or two rounds of Newton's method. We decide when to run Newton's method by the following calculation: one bit of the answer takes about 8 cycles according to calculation in the comments; one round of Newton's method takes about 18 cycles (division, addition, and shift, and maybe assignment), so we should only run Newton's method if we're going to get at least three bits out of it. So for 6 bit answers, we can run the linear method 3 times to get 3 significant bits, then run Newton's method 1 time to get another 3. For 15 bit answers, we run the linear method 4 times to get 4 bits, then Newton's method twice to get another 4 then another 7. And so on.
Those calculations depend on knowing exactly how many cycles are required to get a bit by the linear method vs. how many are required by Newton's method. If the "economics" change, e.g., by discovering a faster way to build up bits in a linear fashion, the decision of when to invoke Newton's method will change.
I unrolled the loops and implemented the decisions as switches, which I hope will translate into fast table lookups in assembly. I'm not absolutely sure that I've got the minimum number of cycles in each case, so maybe further tuning is possible. E.g., for s=10, you can try to get 5 bits then apply Newton's method once instead of twice.
I've tested the algorithm thoroughly for correctness. Some additional minor speedups are possible if you're willing to accept slightly incorrect answers in some cases. At least two cycles are used after applying Newton's method to correct an off-by-one error that occurs with numbers of the form m^2-1. And a cycle is used testing for input 0 at the beginning, as the algorithm can't handle that input. If you know you're never going to take the square root of zero you can eliminate that test. Finally, if you only need 8 significant bits in the answer, you can drop one of the Newton's method calculations.
#include <inttypes.h>
#include <stdint.h>
#include <stdbool.h>
#include <stdio.h>
uint32_t isqrt1(uint32_t n);
int main() {
uint32_t n;
bool it_works = true;
for (n = 0; n < UINT32_MAX; ++n) {
uint32_t sr = isqrt1(n);
if ( sr*sr > n || ( sr < 65535 && (sr+1)*(sr+1) <= n )) {
it_works = false;
printf("isqrt(%" PRIu32 ") = %" PRIu32 "\n", n, sr);
}
}
if (it_works) {
printf("it works\n");
}
return 0;
}
/* table modified to return shift s to move 1 to msb of square root of x */
/*
static const uint8_t debruijn32[32] = {
0, 31, 9, 30, 3, 8, 13, 29, 2, 5, 7, 21, 12, 24, 28, 19,
1, 10, 4, 14, 6, 22, 25, 20, 11, 15, 23, 26, 16, 27, 17, 18
};
*/
static const uint8_t debruijn32[32] = {
15, 0, 11, 0, 14, 11, 9, 1, 14, 13, 12, 5, 9, 3, 1, 6,
15, 10, 13, 8, 12, 4, 3, 5, 10, 8, 4, 2, 7, 2, 7, 6
};
/* based on CLZ emulation for non-zero arguments, from
* http://stackoverflow.com/questions/23856596/counting-leading-zeros-in-a-32-bit-unsigned-integer-with-best-algorithm-in-c-pro
*/
uint8_t shift_for_msb_of_sqrt(uint32_t x) {
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
x++;
return debruijn32 [x * 0x076be629 >> 27];
}
uint32_t isqrt1(uint32_t n) {
if (n==0) return 0;
uint32_t s = shift_for_msb_of_sqrt(n);
uint32_t c = 1 << s;
uint32_t g = c;
switch (s) {
case 9:
case 5:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 15:
case 14:
case 13:
case 8:
case 7:
case 4:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 12:
case 11:
case 10:
case 6:
case 3:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 2:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 1:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 0:
if (g*g > n) {
g ^= c;
}
}
/* now apply one or two rounds of Newton's method */
switch (s) {
case 15:
case 14:
case 13:
case 12:
case 11:
case 10:
g = (g + n/g) >> 1;
case 9:
case 8:
case 7:
case 6:
g = (g + n/g) >> 1;
}
/* correct potential error at m^2-1 for Newton's method */
return (g==65536 || g*g>n) ? g-1 : g;
}
In light testing on my machine (which admittedly is nothing like yours), the new isqrt1 routine runs about 40% faster on average than the previous isqrt routine I gave.

If multiplication is the same speed (or faster than!) addition and shifting, or if you lack a fast shift-by-amount-contained-in-a-register instruction, then the following will not be helpful. Otherwise:
You're computing temp*temp afresh on each loop cycle, but temp = res | add, which is the same as res + add since their bits don't overlap, and (a) you have already computed res*res on a previous loop cycle, and (b) add has special structure (it's always just a single bit). So, using the fact that (a+b)^2 = a^2 + 2ab + b^2, and you already have a^2, and b^2 is just a single bit shifted twice as far to the left as the single-bit b, and 2ab is just a left-shifted by 1 more position than the location of the single bit in b, you can get rid of the multiplication:
unsigned short int int_sqrt32(unsigned int x)
{
unsigned short int res = 0;
unsigned int res2 = 0;
unsigned short int add = 0x8000;
unsigned int add2 = 0x80000000;
int i;
for(i = 0; i < 16; i++)
{
unsigned int g2 = res2 + (res << i) + add2;
if (x >= g2)
{
res |= add;
res2 = g2;
}
add >>= 1;
add2 >>= 2;
}
return res;
}
Also I would guess that it's a better idea to use the same type (unsigned int) for all variables, since according to the C standard, all arithmetic requires promotion (conversion) of narrower integer types to the widest type involved before the arithmetic operation is performed, followed by subsequent back-conversion if necessary. (This may of course be optimised away by a sufficiently intelligent compiler, but why take the risk?)

From the comment trail, it seems that the RISC processor only provides 32x32->32 bit multiplication and 16x16->32 bit multiplication. A 32x-32->64 bit widening multiply, or a MULHI instruction returning the upper 32 bits of a 64-bit product is not provided.
This would seem to exclude approaches based on Newton-Raphson iteration, which would likely be inefficient, as they typically require either MULHI instruction or widening multiply for the intermediate fixed-point arithmetic.
The C99 code below uses a different iterative approach that requires only 16x16->32 bit multiplies, but converges somewhat linearly, requiring up to six iterations. This approach requires CLZ functionality to quickly determine a starting guess for the iterations. Asker stated in the comments that the RISC processor used does not provide CLZ functionality. So emulation of CLZ is required, and since the emulation adds to both storage and instruction count, this may make this approach uncompetitive. I performed a brute-force search to determine the deBruijn lookup table with the smallest multiplier.
This iterative algorithm delivers raw results quite close to the desired results, i.e. (int)sqrt(x), but always somewhat on the high side due to the truncating nature of integer arithmetic. To arrive at the final result, the result is conditionally decremented until the square of the result is less than or equal to the original argument.
The use of the volatile qualifier in the code only serves to establish that all named variables can in fact be allocated as 16-bit data without impacting the functionality. I do not know whether this provides any advantage, but noticed that the OP specifically used 16-bit variables in their code. For production code, volatile should be removed.
Note that for most processors, the correction steps at the end should not involve any branching. The product y*y can be subtracted from x with carry-out (or borrow-out), then y is corrected by a subtract with carry-in (or borrow-in). So each step should be a sequence MUL, SUBcc, SUBC.
Because implementation of the iteration by a loop incurs substantial overhead, I have elected to completely unroll the loop, but provide two early-out checks. Tallying the operations manually I count 46 operations for the fastest case, 54 operations for the average case, and 60 operations for the worst case.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
static const uint8_t clz_tab[32] = {
31, 22, 30, 21, 18, 10, 29, 2, 20, 17, 15, 13, 9, 6, 28, 1,
23, 19, 11, 3, 16, 14, 7, 24, 12, 4, 8, 25, 5, 26, 27, 0};
uint8_t clz (uint32_t a)
{
a |= a >> 16;
a |= a >> 8;
a |= a >> 4;
a |= a >> 2;
a |= a >> 1;
return clz_tab [0x07c4acdd * a >> 27];
}
/* 16 x 16 -> 32 bit unsigned multiplication; should be single instruction */
uint32_t umul16w (uint16_t a, uint16_t b)
{
return (uint32_t)a * b;
}
/* Reza Hashemian, "Square Rooting Algorithms for Integer and Floating-Point
Numbers", IEEE Transactions on Computers, Vol. 39, No. 8, Aug. 1990, p. 1025
*/
uint16_t isqrt (uint32_t x)
{
volatile uint16_t y, z, lsb, mpo, mmo, lz, t;
if (x == 0) return x; // early out, code below can't handle zero
lz = clz (x); // #leading zeros, 32-lz = #bits of argument
lsb = lz & 1;
mpo = 17 - (lz >> 1); // m+1, result has roughly half the #bits of argument
mmo = mpo - 2; // m-1
t = 1 << mmo; // power of two for two's complement of initial guess
y = t - (x >> (mpo - lsb)); // initial guess for sqrt
t = t + t; // power of two for two's complement of result
z = y;
y = (umul16w (y, y) >> mpo) + z;
y = (umul16w (y, y) >> mpo) + z;
if (x >= 0x40400) {
y = (umul16w (y, y) >> mpo) + z;
y = (umul16w (y, y) >> mpo) + z;
if (x >= 0x1002000) {
y = (umul16w (y, y) >> mpo) + z;
y = (umul16w (y, y) >> mpo) + z;
}
}
y = t - y; // raw result is 2's complement of iterated solution
y = y - umul16w (lsb, (umul16w (y, 19195) >> 16)); // mult. by sqrt(0.5)
if ((int32_t)(x - umul16w (y, y)) < 0) y--; // iteration may overestimate
if ((int32_t)(x - umul16w (y, y)) < 0) y--; // result, adjust downward if
if ((int32_t)(x - umul16w (y, y)) < 0) y--; // necessary
return y; // (int)sqrt(x)
}
int main (void)
{
uint32_t x = 0;
uint16_t res, ref;
do {
ref = (uint16_t)sqrt((double)x);
res = isqrt (x);
if (res != ref) {
printf ("!!!! x=%08x res=%08x ref=%08x\n", x, res, ref);
return EXIT_FAILURE;
}
x++;
} while (x);
return EXIT_SUCCESS;
}
Another possibility is to use the Newton iteration for the square root, despite the high cost of division. For small inputs only one iteration will be required. Although the asker did not state this, based on the execution time of 16 cycles for the DIV operation I strongly suspect that this is actually a 32/16->16 bit division which requires additional guard code to avoid overflow, defined as a quotient that does not fit into 16 bits. I have added appropriate safeguards to my code based on this assumption.
Since the Newton iteration doubles the number of good bits each time it is applied, we only need a low-precision initial guess which can easily be retrieved from a table based on the five leading bits of the argument. In order to grab these, we first normalize the argument into 2.30 fixed-point format with an additional implicit scale factor of 232-(lz & ~1) where lz are the number of leading zeros in the argument. As in the previous approach the iteration doesn't always deliver an accurate result, so a correction must be applied should the preliminary result be too big. I count 49 cycles for the fast path, 70 cycles for the slow path (average 60 cycles).
static const uint16_t sqrt_tab[32] =
{ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x85ff, 0x8cff, 0x94ff, 0x9aff, 0xa1ff, 0xa7ff, 0xadff, 0xb3ff,
0xb9ff, 0xbeff, 0xc4ff, 0xc9ff, 0xceff, 0xd3ff, 0xd8ff, 0xdcff,
0xe1ff, 0xe6ff, 0xeaff, 0xeeff, 0xf3ff, 0xf7ff, 0xfbff, 0xffff
};
/* 32/16->16 bit division. Note: Will overflow if x[31:16] >= y */
uint16_t udiv_32_16 (uint32_t x, uint16_t y)
{
uint16_t r = x / y;
return r;
}
/* table lookup for initial guess followed by division-based Newton iteration*/
uint16_t isqrt (uint32_t x)
{
volatile uint16_t q, lz, y, i, xh;
if (x == 0) return x; // early out, code below can't handle zero
// initial guess based on leading 5 bits of argument normalized to 2.30
lz = clz (x);
i = ((x << (lz & ~1)) >> 27);
y = sqrt_tab[i] >> (lz >> 1);
xh = (x >> 16); // needed for overflow check on division
// first Newton iteration, guard against overflow in division
q = 0xffff;
if (xh < y) q = udiv_32_16 (x, y);
y = (q + y) >> 1;
if (lz < 10) {
// second Newton iteration, guard against overflow in division
q = 0xffff;
if (xh < y) q = udiv_32_16 (x, y);
y = (q + y) >> 1;
}
if (umul16w (y, y) > x) y--; // adjust quotient if too large
return y; // (int)sqrt(x)
}

I don't know how to turn it into an efficient algorithm but when I investigated this in the '80s an interesting pattern emerged. When rounding square roots, there are two more integers with that square root than the preceding one (after zero).
So, one number (zero) has a square root of zero, two have a square root of 1 (1 and 2), 4 have a square root of two (3, 4, 5 and 6) and so on. Probably not a useful answer but interesting nonetheless.

Here is a less incremental version of the technique #j_random_hacker described. On at least one processor it was just a bit faster when I fiddled with this a couple of years ago. I have no idea why.
// assumes unsigned is 32 bits
unsigned isqrt1(unsigned x) {
unsigned r = 0, r2 = 0;
for (int p = 15; p >= 0; --p) {
unsigned tr2 = r2 + (r << (p + 1)) + (1u << (p + p));
if (tr2 <= x) {
r2 = tr2;
r |= (1u << p);
}
}
return r;
}
/*
gcc 6.3 -O2
isqrt(unsigned int):
mov esi, 15
xor r9d, r9d
xor eax, eax
mov r8d, 1
.L3:
lea ecx, [rsi+1]
mov edx, eax
mov r10d, r8d
sal edx, cl
lea ecx, [rsi+rsi]
sal r10d, cl
add edx, r10d
add edx, r9d
cmp edx, edi
ja .L2
mov r11d, r8d
mov ecx, esi
mov r9d, edx
sal r11d, cl
or eax, r11d
.L2:
sub esi, 1
cmp esi, -1
jne .L3
rep ret
*/
If you turn up gcc 9 x86 optimization, it completely unrolls the loop and folds constants. The result is still only about 100 instructions.

Related

Find the INDEX of element having max. absolute value using AVX512 instructions

I'm a newbie for coding using AVX512 instructions. My machine is Intel KNL 7250. I am trying to use AVX512 instructions to find the INDEX of the element having maximum absolute value, which double precision and size of array % 8 = 0. But it prints an output index = 0 every time. I don't know where is a problem, please help me.
Also, how to use printf for __m512i type?
Thanks.
Code:
void main()
{
int i;
int N=160;
double vec[N];
for(i=0;i<N;i++)
{
vec[i]=(double)(-i) ;
if(i==10)
{
vec[i] = -1127;
}
}
int max = avxmax_(N, vec);
printf("maxindex=%d\n", max);
}
int avxmax_(int N, double *X )
{
// return the index of element having maximum absolute value.
int maxindex, ix, i, M;
register __m512i increment, indices, maxindices, maxvalues, absmax, max_values_tmp, abs_max_tmp, tmp;
register __mmask8 gt;
double values_X[8];
double indices_X[8];
double maxvalue;
maxindex = 1;
if( N == 1) return(maxindex);
M = N % 8;
if( M == 0)
{
increment = _mm512_set1_epi64(8); // [8,8,8,8,8,8,8,8]
indices = _mm512_setr_epi64(0, 1, 2, 3, 4, 5, 6, 7);
maxindices = indices;
maxvalues = _mm512_loadu_si512(&X[0]);
absmax = _mm512_abs_epi64(maxvalues);
for( i = 8; i < N; i += 8)
{
// advance scalar indices: indices[0] + 8, indices[1] + 8,...,indices[7] + 8
indices = _mm512_add_epi64(indices, increment);
// compare
max_values_tmp = _mm512_loadu_si512(&X[i]);
abs_max_tmp = _mm512_abs_epi64(max_values_tmp);
gt = _mm512_cmpgt_epi64_mask(abs_max_tmp, absmax);
// update
maxindices = _mm512_mask_blend_epi64(gt, maxindices, indices);
absmax = _mm512_max_epi64(absmax, abs_max_tmp);
}
// scalar part
_mm512_storeu_si512((__m512i*)values_X, absmax);
_mm512_storeu_si512((__m512i*)indices_X, maxindices);
maxindex = indices_X[0];
maxvalue = values_X[0];
for(i = 1; i < 8; i++)
{
if(values_X[i] > maxvalue)
{
maxvalue = values_X[i];
maxindex = indices_X[i];
}
}
return(maxindex);
}
}

Your function returns 0 because you're treating the int64 index as the bit-pattern for a double, and converting that (tiny) number to an integer. double indices_X[8]; is the bug; should be uint64_t. There are other bugs, see below.
This bug is easier to spot if you declare variables as you use them, C99 style, not obsolete C89 style.
You _mm512_storeu_si512 the vector of int64_t indices into double indices_X[8], type-punning it to double, then in plain C do int maxindex = indices_X[0];. This is implicit type-conversion, converting that subnormal double to an integer.
(I noticed a mysterious vcvttsd2si FP->int conversion in the asm https://godbolt.org/z/zsfc36 while converting the code to C99 style variable declarations next to initializers. That was a clue: there should be no FP->int conversion in this function. I noticed that around the same time I was moving the double indices_X[8]; declaration down into the block that uses it, and noticing it had type double.)
It is actually possible to use integer operations on FP bit-patterns
But only if you use the right ones! IEEE754 exponent biases mean that the encoding / bit-pattern can be compared as a sign/magnitude integer. So you can do abs / min / max and compare on it, but not of course integer add / sub (unless you're implementing nextafter).
_mm512_abs_epi64 is 2's complement abs, not sign-magnitude. Instead, you must just mask off the sign bit. Then you're all set to treat the result as an unsigned integer or signed-2's-complement. (Either works because the high bit is clear.)
Using integer max has the interesting property that NaNs will compare higher than any other value, Inf below that, then finite values. So we get a NaN-propagating max-finder basically for free.
On KNL (Knight's Landing), FP vmaxpd and vcmppd have the same performance as their integer equivalents: 2 cycle latency, 0.5c throughput. (https://agner.org/optimize/). So your way has zero advantage on KNL, but it's a neat trick for mainstream Intel, like Skylake-X and IceLake.
Bugfixed optimized version:
Use size_t for return type and loop counters / indices to handle potentially huge arrays, instead of a random mix of int and 64-bit vector elements. (uint64_t for the temp array that collects the horizontal-max: it's always 64-bit even in a build with 32-bit pointers / size_t.)
bugfix: return 0 on N==1, not 1: the index of the only element is 0.
bugfix: return -1 on N%8 != 0, instead of falling off the end of the non-void function. (Undefined behaviour if the caller uses the result in C, or as soon as execution falls off the end in C++).
bugfix: abs of an FP value = clear the sign bit, not 2's complement abs on the bit-pattern
sort of bugfix: use unsigned integer compare and max, so it would work for 2's complement integers with _mm512_abs_epi64 (which produces an unsigned result; remember that -LONG_MIN overflows to LONG_MIN if you keep treating it as signed).
style improvement: if (N%8 != 0) return -1; instead of putting most of the body in an if block.
style improvement: declare vars when they're first used, and removed some unused ones that were pure noise. This is idiomatic for C since C99, which was standardized over 20 years ago.
style improvement: use simpler names for tmp vector vars that just hold a load result. Sometimes you just need a tmp var because intrinsic names are so long that you don't want to type _mm...load... as an arg for another intrinsics. A name like v scoped to the inner loop is a clear sign it's just a placeholder, not used later. (This style works best when you're declaring it as you init it, so it's easy to see it can't be used in an outer scope.)
optimization: reduce 8 -> 4 elements after the loop with SIMD: extract the high half, combine with existing low half. (Same as you would for a simpler horizontal reduction like sum or max). Inconvenient when we need instructions that only AVX512 has, but KNL doesn't have AVX512VL, so we have to use the 512-bit version and ignore the high garbage. But KNL does have AVX1 / AVX2 so we can still store 256-bit vectors and do some things.
Using a merge-masking _mm512_mask_extracti64x4_epi64 extract to blend the high half directly the low half of the same vector is a cool trick which compilers don't find if you use a 512-bit mask-blend. :P
sort of bugfix: in C, main has a return type of int in hosted implementations (running under an OS).
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
// bugfix: indices can be larger than an int
size_t avxmax_(size_t N, double *X )
{
// return the index of element having maximum absolute value.
if( N == 1)
return 0; // bugfix: 0 is the only valid element in this case, not 1
if( N % 8 != 0) // [[unlikely]] // C++20
return -1; // bugfix: don't fall off the end of the function in this case
const __m512i fp_absmask = _mm512_set1_epi64(0x7FFFFFFFFFFFFFFF);
__m512i indices = _mm512_setr_epi64(0, 1, 2, 3, 4, 5, 6, 7);
__m512i maxindices = indices;
__m512i v = _mm512_loadu_si512(&X[0]);
__m512i absmax = _mm512_and_si512(v, fp_absmask);
for(size_t i = 8; i < N; i += 8) // [[likely]] // C++20
{
// advance indices by 8 each.
indices = _mm512_add_epi64(indices, _mm512_set1_epi64(8));
// compare
v = _mm512_loadu_si512(&X[i]);
__m512i vabs = _mm512_and_si512(v, fp_absmask);
// vabs = _mm512_abs_epi64(max_values_tmp); // for actual integers, not FP bit patterns
__mmask8 gt = _mm512_cmpgt_epu64_mask(vabs, absmax);
// update
maxindices = _mm512_mask_blend_epi64(gt, maxindices, indices);
absmax = _mm512_max_epu64(absmax, vabs);
}
// reduce 512->256; KNL doesn't have AVX512VL so some ops require 512-bit vectors
__m256i absmax_hi = _mm512_extracti64x4_epi64(absmax, 1);
__m512i absmax_hi512 = _mm512_castsi256_si512(absmax_hi); // free
__mmask8 gt = _mm512_cmpgt_epu64_mask(absmax_hi512, absmax);
__m256i abs256 = _mm512_castsi512_si256(_mm512_max_epu64(absmax_hi512, absmax)); // reduced to low 4 elements
// extract with merge-masking = blend
__m256i maxindices256 = _mm512_mask_extracti64x4_epi64(
_mm512_castsi512_si256(maxindices), gt, maxindices, 1);
// scalar part
double values_X[4];
uint64_t indices_X[4];
_mm256_storeu_si256((__m256i*)values_X, abs256);
_mm256_storeu_si256((__m256i*)indices_X, maxindices256);
size_t maxindex = indices_X[0];
double maxvalue = values_X[0];
for(int i = 1; i < 4; i++)
{
if(values_X[i] > maxvalue)
{
maxvalue = values_X[i];
maxindex = indices_X[i];
}
}
return maxindex;
}
On Godbolt: the main loop from GCC10.2 -O3 -march=knl is 8 instructions. So even if (best case) KNL could decode and run it at 2/clock, it's still taking 4 cycles per vector. You can run the program on Godbolt; it runs on Skylake-X servers so it can run AVX512 code. You can see it prints 10.
.L4:
vpandd zmm2, zmm5, ZMMWORD PTR [rsi+rax*8] # load, folded into AND
add rax, 8
vpcmpuq k1, zmm2, zmm0, 6
vpaddq zmm1, zmm1, zmm4 # increment current indices
cmp rdi, rax
vmovdqa64 zmm3{k1}, zmm1 # blend maxidx using merge-masking
vpmaxuq zmm0, zmm0, zmm2
ja .L4
vmovapd zmm1, zmm3 # silly missed optimization related to the case where the loop runs 0 times.
.L3:
vextracti64x4 ymm2, zmm0, 0x1 # high half of absmax
vpcmpuq k1, zmm2, zmm0, 6 # compare high and low
vpmaxuq zmm0, zmm0, zmm2
# vunpckhpd xmm2, xmm0, xmm0 # setting up for unrolled scalar loop
vextracti64x4 ymm1{k1}, zmm3, 0x1 # masked extract of indices
Another option for the loop would be a masked vpbroadcastq zmm3{k1}, rax, adding the [0..7] per-element offsets after the loop. That would actually save the vpaddq in the loop, and we have the right i in a register if GCC is going to use an indexed addressing-mode anyway. (That's not good on Skylake-X; defeats micro-fusion of the memory-source vpandd.)
Agner Fog doesn't list performance for GP->vector broadcasts, but hopefully it's only single-uop on KNL at least. (And https://uops.info/ doesn't have KNL or KNM results).
Branchy strategy: when a new max is very rare
If you expect finding a new max to be very rare (e.g. array is large and uniformly distributed, or at least not trending upwards), it could be even faster to broadcast the current max and branch on finding any greater vector element.
Finding a new max means branching out of the loop (which probably mispredicts, so that's slow) and broadcasting that element (probably with a tzcnt to find the element index, then a broadcast-load, and update the index).
Especially with KNL's 4-way SMT to hide branch miss costs, this could be an overall throughput win for large arrays; fewer instructions per element on average.
But probably significantly worse for inputs that do trend upwards, so a new max is found O(n) times on average, not sqrt(n) or log(n) or whatever frequency a uniform distribution would give us.
PS: to print vectors, store to an array and reload the elements. print a __m128i variable
Or use a debugger to show you their elements.

How to optimize C code : looking for the next set bit and finding sum of corresponding array elements

EDIT: Now I realize I didn't explain my algorithm well enough. I'll try again.
What I'm doing is something very similar to dot product of two vectors, but there is a difference. I've got two vectors: one vector of bits and one vector of floats of the same length. So I need to calculate sum:
float[0]*bit[0]+float[1]*bit[1]+..+float[N-1]*bit[N-1], BUT the difference from a classic dot product is that I need to skip some fixed number of elements after each set bit.
Example:
vector of floats = {1.5, 2.0, 3.0, 4.5, 1.0}
vector of bits = {1, 0, 1, 0, 1 }
nSkip = 2
in this case sum is calculated as follows:
sum = floats[0]*bits[0]
bits[0] == 1, so skipping 2 elements (at positions 1 and 2)
sum = sum + floats[3]*bits[3]
bits[3] == 0, so no skipping
sum = sum + floats[4]*bits[4]
result = 1.5*1+4.5*0+1.0*1 = 2.5
The following code is called many times with different data so I need to optimize it to run as fast as possible on my Core i7 (I don't care much about compatibility with anything else). It is optimized to some extent but still slow, but I don't know how to further improve it.
Bit array is implemented as an array of 64 bit unsigned ints, it allows me to use bitscanforward to find the next set bit.
code:
unsigned int i = 0;
float fSum = 0;
do
{
unsigned int nAddr = i / 64;
unsigned int nShift = i & 63;
unsigned __int64 v = bitarray[nAddr] >> nShift;
unsigned long idx;
if (!_BitScanForward64(&idx, v))
{
i+=64-nShift;
continue;
}
i+= idx;
fSum += floatarray[i];
i+= nSkip;
} while(i<nEnd);
Profiler shows 3 slowest hotspots :
1. v = bitarray[nAddr] >> nShift (memory access with shift)
2. _BitScanForward64(&idx, v)
3. fSum += floatarray[i]; (memory access)
But probably there is a different way of doing this. I was thinking about just resetting nSkip bits after each set bit in the bit vector and then calculating classical dot product - didn't try yet but honestly don't belive it will be faster with more memory access.

You have too many of your operations inside of the loop. You also only have one loop, so many of the operations that do need to happen for each flag word (the 64 bit unsigned integer) are happening 63 extra times.
Consider division an expensive operation and try to not do that too often when optimizing code for performance.
Memory access is also considered expensive in terms of how long it takes, so this should also be limited to required accesses only.
Tests that allow you to exit early are often useful (though sometimes the test itself is expensive relative to the operations you'd be avoiding, but that's probably not the case here.
Using nested loops should simplify this a lot. The outer loop should work at the 64 bit word level, and the inner loop should work at the bit level.
I have noticed a mistake in my earlier recommendations. Since the division here is by 64, which is a power of 2, this is not actually an expensive operation, but we still need to get as many operations as far out of the loops as we can.
/* this is completely untested, but incorporates the optimizations
that I outlined as well as a few others.
I process the arrays backwards, which allows for elimination of
comparisons of variables against other variables, which is much
slower than comparisons of variables against 0, which is essentially
free on many processors when you have just operated or loaded the
value to a register.
Going backwards at the bit level also allows for the possibility that
the compiler will take advantage of the comparison of the top bit
being the same as test for negative, which is cheap and mostly free
for all but the first time through the inner loop (for each time
through the outer loop.
*/
double acc = 0.0;
unsigned i_end = nEnd-1;
unsigned i_bit;
int i_word_end;
if (i_end == 0)
{
return acc;
}
i_bit = i_end % 64;
i_word = i_end / 64;
do
{
unsigned __int64 v = bitarray[i_word_end];
unsigned i_upper = i_word_end << 64;
while (v)
{
if (v & 0x80000000000000)
{
// The following code is semantically the same as
// unsigned i = i_bit_end + (i_word_end * sizeof(v));
unsigned i = i_bit_end | i_upper;
acc += floatarray[i];
}
v <<= 1;
i--;
}
i_bit_end = 63;
i_word_end--;
} while (i_word_end >= 0);

I think you should check "how to ask questions" first. You will not gain many upvotes for this, since you are asking us to do the work for you instead of introducing a particular problem.
I cannot see why you are incrementing the same variable in two places instead of one (i).
Also think you should declare variables only once, not in every iteration.

Removing slow int64 division from fixed point atan2() approximation

I made a function to compute a fixed-point approximation of atan2(y, x). The problem is that of the ~83 cycles it takes to run the whole function, 70 cycles (compiling with gcc 4.9.1 mingw-w64 -O3 on an AMD FX-6100) are taken entirely by a simple 64-bit integer division! And sadly none of the terms of that division are constant. Can I speed up the division itself? Is there any way I can remove it?
I think I need this division because since I approximate atan2(y, x) with a 1D lookup table I need to normalise the distance of the point represented by x,y to something like a unit circle or unit square (I chose a unit 'diamond' which is a unit square rotated by 45°, which gives a pretty even precision across the positive quadrant). So the division finds (|y|-|x|) / (|y|+|x|). Note that the divisor is in 32-bits while the numerator is a 32-bit number shifted 29 bits right so that the result of the division has 29 fractional bits. Also using floating point division is not an option as this function is required not to use floating point arithmetic.
Any ideas? I can't think of anything to improve this (and I can't figure out why it takes 70 cycles just for a division). Here's the full function for reference:
int32_t fpatan2(int32_t y, int32_t x) // does the equivalent of atan2(y, x)/2pi, y and x are integers, not fixed point
{
#include "fpatan.h" // includes the atan LUT as generated by tablegen.exe, the entry bit precision (prec), LUT size power (lutsp) and how many max bits |b-a| takes (abdp)
const uint32_t outfmt = 32; // final output format in s0.outfmt
const uint32_t ofs=30-outfmt, ds=29, ish=ds-lutsp, ip=30-prec, tp=30+abdp-prec, tmask = (1<<ish)-1, tbd=(ish-tp); // ds is the division shift, the shift for the index, bit precision of the interpolation, the mask, the precision for t and how to shift from p to t
const uint32_t halfof = 1UL<<(outfmt-1); // represents 0.5 in the output format, which since it is in turns means half a circle
const uint32_t pds=ds-lutsp; // division shift and post-division shift
uint32_t lutind, p, t, d;
int32_t a, b, xa, ya, xs, ys, div, r;
xs = x >> 31; // equivalent of fabs()
xa = (x^xs) - xs;
ys = y >> 31;
ya = (y^ys) - ys;
d = ya+xa;
if (d==0) // if both y and x are 0 then they add up to 0 and we must return 0
return 0;
// the following does 0.5 * (1. - (y-x) / (y+x))
// (y+x) is u1.31, (y-x) is s0.31, div is in s1.29
div = ((int64_t) (ya-xa)<<ds) / d; // '/d' normalises distance to the unit diamond, immediate result of division is always <= +/-1^ds
p = ((1UL<<ds) - div) >> 1; // before shift the format is s2.29. position in u1.29
lutind = p >> ish; // index for the LUT
t = (p & tmask) >> tbd; // interpolator between two LUT entries
a = fpatan_lut[lutind];
b = fpatan_lut[lutind+1];
r = (((b-a) * (int32_t) t) >> abdp) + (a<<ip); // linear interpolation of a and b by t in s0.32 format
// Quadrants
if (xs) // if x was negative
r = halfof - r; // r = 0.5 - r
r = (r^ys) - ys; // if y was negative then r is negated
return r;
}

Unfortunately a 70 cycles latency is typical for a 64-bit integer division on x86 CPUs. Floating point division typically has about half the latency or less. The increased cost comes from the fact modern CPUs only have dividers in their floating point execution units (they're very expensive in terms silicon area), so need to convert the integers to floating point and back again. So just substituting a floating division in place of the integer one isn't likely to help. You'll need to refactor your code to use floating point instead to take advantage of faster floating point division.
If you're able to refactor your code you might also be able to benefit from the approximate floating-point reciprocal instruction RCPSS, if you don't need an exact answer. It has a latency of around 5 cycles.

Based on #Iwillnotexist Idonotexist's suggestion to use lzcnt, reciprocity and multiplication I implemented a division function that runs in about 23.3 cycles and with a pretty great precision of 1 part in 19 million with a 1.5 kB LUT, e.g. one of the worst cases being for 1428769848 / 1080138864 you might get 1.3227648959 instead of 1.3227649663.
I figured out an interesting technique while researching this, I was really struggling to think of something that could be fast and precise enough, as not even a quadratic approximation of 1/x in [0.5 , 1.0) combined with an interpolated difference LUT would do, then I had the idea of doing it the other way around so I made a lookup table that contains the quadratic coefficients that fit the curve on a short segment that represents 1/128th of the [0.5 , 1.0) curve, which gives you a very small error like so. And using the 7 most significant bits of what represents x in the [0.5 , 1.0) range as a LUT index I directly get the coefficients that work best for the segment that x falls into.
Here's the full code with the lookup tables ffo_lut.h and fpdiv.h:
#include "ffo_lut.h"
static INLINE int32_t log2_ffo32(uint32_t x) // returns the number of bits up to the most significant set bit so that 2^return > x >= 2^(return-1)
{
int32_t y;
y = x>>21; if (y) return ffo_lut[y]+21;
y = x>>10; if (y) return ffo_lut[y]+10;
return ffo_lut[x];
}
// Usage note: for fixed point inputs make outfmt = desired format + format of x - format of y
// The caller must make sure not to divide by 0. Division by 0 causes a crash by negative index table lookup
static INLINE int64_t fpdiv(int32_t y, int32_t x, int32_t outfmt) // ~23.3 cycles, max error (by division) 53.39e-9
{
#include "fpdiv.h" // includes the quadratic coefficients LUT (1.5 kB) as generated by tablegen.exe, the format (prec=27) and LUT size power (lutsp)
const int32_t *c;
int32_t xa, xs, p, sh;
uint32_t expon, frx, lutind;
const uint32_t ish = prec-lutsp-1, cfs = 31-prec, half = 1L<<(prec-1); // the shift for the index, the shift for 31-bit xa, the value of 0.5
int64_t out;
int64_t c0, c1, c2;
// turn x into xa (|x|) and sign of x (xs)
xs = x >> 31;
xa = (x^xs) - xs;
// decompose |x| into frx * 2^expon
expon = log2_ffo32(xa);
frx = (xa << (31-expon)) >> cfs; // the fractional part is now in 0.27 format
// lookup the 3 quadratic coefficients for c2*x^2 + c1*x + c0 then compute the result
lutind = (frx - half) >> ish; // range becomes [0, 2^26 - 1], in other words 0.26, then >> (26-lutsp) so the index is lutsp bits
lutind *= 3; // 3 entries for each index
c = &fpdiv_lut[lutind]; // c points to the correct c0, c1, c2
c0 = c[0]; c1 = c[1]; c2 = c[2];
p = (int64_t) frx * frx >> prec; // x^2
p = c2 * p >> prec; // c2 * x^2
p += c1 * frx >> prec; // + c1 * x
p += c0; // + c0, p = (1.0 , 2.0] in 2.27 format
// apply the necessary bit shifts and reapplies the original sign of x to make final result
sh = expon + prec - outfmt; // calculates the final needed shift
out = (int64_t) y * p; // format is s31 + 1.27 = s32.27
if (sh >= 0)
out >>= sh;
else
out <<= -sh;
out = (out^xs) - xs; // if x was negative then out is negated
return out;
}
I think ~23.3 cycles is about as good as it's gonna get for what it does, but if you have any ideas to shave a few cycles off please let me know.
As for the fpatan2() question the solution would be to replace this line:
div = ((int64_t) (ya-xa)<<ds) / d;
with that line:
div = fpdiv(ya-xa, d, ds);

Yours time hog instruction:
div = ((int64_t) (ya-xa)<<ds) / d;
exposes at least two issues. The first one is that you mask the builtin div function; but this is minor fact, could be never observed. The second one is that first, according to C language rules, both operands are converted to common type which is int64_t, and, then, division for this type is expanded into CPU instruction which divides 128-bit dividend by 64-bit divisor(!) Extract from assembly of cut-down version of your function:
21: 48 89 c2 mov %rax,%rdx
24: 48 c1 fa 3f sar $0x3f,%rdx ## this is sign bit extension
28: 48 f7 fe idiv %rsi
Yep, this division requires about 70 cycles and can't be optimized (well, really it can, but e.g. reverse divisor approach requires multiplication with 192-bit product). But if you are sure this division can be done with 64-bit dividend and 32-bit divisor and it won't overflow (quotient will fit into 32 bits) (I agree because ya-xa is always less by absolute value than ya+xa), this can be sped up using explicit assembly request:
uint64_t tmp_num = ((int64_t) (ya-xa))<<ds;
asm("idivl %[d]" :
[a] "=a" (div1) :
"[a]" (tmp_num), "d" (tmp_num >> 32), [d] "q" (d) :
"cc");
this is quick&dirty and shall be carefully verified, but I hope the idea is understood. The resulting assembly now looks like:
18: 48 98 cltq
1a: 48 c1 e0 1d shl $0x1d,%rax
1e: 48 89 c2 mov %rax,%rdx
21: 48 c1 ea 20 shr $0x20,%rdx
27: f7 f9 idiv %ecx
This seems to be huge advance because 64/32 division requires up to 25 clock cycles on Core family, according to Intel optimization manual, instead of 70 you see for 128/64 division.
More minor approvements can be added; e.g. shifts can be done yet more economically in parallel:
uint32_t diff = ya - xa;
uint32_t lowpart = diff << 29;
uint32_t highpart = diff >> 3;
asm("idivl %[d]" :
[a] "=a" (div1) :
"[a]" (lowpart), "d" (highpart), [d] "q" (d) :
"cc");
which results in:
18: 89 d0 mov %edx,%eax
1a: c1 e0 1d shl $0x1d,%eax
1d: c1 ea 03 shr $0x3,%edx
22: f7 f9 idiv %ecx
but this is minor fix, compared to the division-related one.
To conclude, I really doubt this routine is worth to be implemented in C language. The latter is quite ineconomical in integer arithmetic, requiring useless expansions and high part losses. The whole routine is worth to be moved to assembler.

Given an fpatan() implementation, you could simply implement fpatan2() in terms of that.
Assuming constants defined for pi abd pi/2:
int32_t fpatan2( int32_t y, int32_t x)
{
fixed theta ;
if( x == 0 )
{
theta = y > 0 ? fixed_half_pi : -fixed_half_pi ;
}
else
{
theta = fpatan( y / x ) ;
if( x < 0 )
{
theta += ( y < 0 ) ? -fixed_pi : fixed_pi ;
}
}
return theta ;
}
Note that fixed library implementations are easy to get very wrong. You might take a look at Optimizing Math-Intensive Applications with Fixed-Point Arithmetic. The use of C++ in the library under discussion makes the code much simpler, in most cases you can just replace the float or double keyword with fixed. It does not however have an atan2() implementation, the code above is adapted from my implementation for that library.

Find which power of 2 range a number falls within? (In C)

As in whether it falls within 2^3 - 2^4, 2^4 - 2^5, etc. The number returned would be the EXPONENT itself (minus an offset).
How could this be done extremely quickly and efficiently as possible? This function will be called a lot in a program that is EXTREMELY dependent on speed. This is my current code but it is far too inefficient as it uses a for loop.
static inline size_t getIndex(size_t numOfBytes)
{
int i = 3;
for (; i < 32; i++)
{
if (numOfBytes < (1 << i))
return i - OFFSET;
}
return (NUM_OF_BUCKETS - 1);
}
Thank you very much!

What you're after is simply log2(n), as far as I can tell.
It might be worth cheating and using some inline assembly if your target architecture(s) have instructions that can do this. See the Wikipedia entry on "find first set" for lots of discussion and information about hardware support.

One way to do it would be to find the highest order bit that is set to 1. I'm trying to think if this is efficient though, since you'll still have to do n checks in worst case.
Maybe you could do a binary search style where you check if it's greater than 2^16, if so, check if it's greater than 2^24 (assuming 32 bits here), and if not, then check if it's greater than 2^20, etc... That would be log(n) checks, but I'm not sure of the efficiency of a bit check vs a full int comparison.
Could get some perf data on either.

There is a particularly efficient algorithm using de Bruijn sequences described on Sean Eron Anderson's excellent Bit Twiddling Hacks page:
uint32_t v; // find the log base 2 of 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
};
v |= v >> 1; // first round down to one less than a power of 2
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
r = MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27];
It works in 13 operations without branching!

You are basically trying to compute: floor(log2(x))
Take the logarithm to the base 2, then take the floor.
The most portable way to do this in C is to use the logf() function, which finds the log to the base e, then adjust: log2(x) == logf(x) / logf(2.0)
See the answer here: How to write log base(2) in c/c++
If you just cast the resulting float value to int, you compute floor() at the same time.
But, if it is available to you and you can use it, there is an extremely fast way to compute log2() of a floating point number: logbf()
From the man page:
The inte-
ger constant FLT_RADIX, defined in <float.h>, indicates the radix used
for the system's floating-point representation. If FLT_RADIX is 2,
logb(x) is equal to floor(log2(x)), except that it is probably faster.
http://linux.die.net/man/3/logb
If you think about how floating-point numbers are stored, you realize that the value floor(log2(x)) is part of the number, and if you just extract that value you are done. A little bit of shifting and bit-masking, and subtract the bias from the exponent (or technically the "significand") and there you have it. The fastest way possible to compute floor(log2(x)) for any float value x.
http://en.wikipedia.org/wiki/Single_precision
But actually logbf() converts the result to a float before giving it to you, and handles errors. If you write your own function to extract the exponent as an integer, it will be slightly faster and an integer is what you want anyway. If you wanted to write your own function you need to use a C union to gain access to the bits inside the float; trying to play with pointers will get you warnings or errors related to "type-punning", at least on GCC. I will give details on how to do this, if you ask. I have written this code before, as an inline function.
If you only have a small range of numbers to test, you could possibly cast your numbers to integer and then use a lookup table.

You can make use of floating number representation:
double n_bytes = numOfBytes
Taking the exponent bits should give you the result as floating numbers are represented as:
(-1)^S X (1. + M) X 2^E
Where:
S - Sign
M - Mantissa
E - Exponent
To construct the mask and shift you would have to read about the exact bit pattern of the floating point type you are using.
The CPU floating point support does most of the work for you.
An even better way would be to use the built-in function:
double frexp (double x, int * exp );
Floating point representation

#include <Limits.h> // For CHAR_BIT.
#include <math.h> // For frexp.
#include <stdio.h> // For printing results, as a demonstration.
// These routines assume 0 < x.
/* This requires GCC (or any other compiler that supplies __builtin_clz). It
should perform well on any machine with a count-leading-zeroes instruction
or something similar.
*/
static int log2A(unsigned int x)
{
return sizeof x * CHAR_BIT - 1 - __builtin_clz(x);
}
/* This requires that a double be able to exactly represent any unsigned int.
(This is true for 32-bit integers and 64-bit IEEE 754 floating-point.) It
might perform well on some machines and poorly on others.
*/
static int log2B(unsigned int x)
{
int exponent;
frexp(x, &exponent);
return exponent - 1;
}
int main(void)
{
// Demonstrate the routines.
for (unsigned int x = 1; x; x <<= 1)
printf("0x%08x: log2A -> %2d, log2B -> %2d.\n", x, log2A(x), log2B(x));
return 0;
}

This is generally fast on any machine with hardware floating point unit:
((union { float val; uint32_t repr; }){ x }.repr >> 23) - 0x7f
The only assumptions it makes are that floating point is IEEE and integer and floating point endianness match, both of which are true on basically all real-world systems (certainly all modern ones).
Edit: When I've used this in the past, I didn't need it for large numbers. Eric points out that it will give the wrong result for ints that don't fit in float. Here is a revised (albeit possibly slower) version that fixes that and supports values up to 52 bits (in particular, all 32-bit positive integer inputs):
((union { double val; uint64_t repr; }){ x }.repr >> 52) - 0x3ff
Also note that I'm assuming x is a positive (not just non-negative, also nonzero) number. If x is negative you'll get a bogus result, and if x is 0, you'll get a large negative result (approximating negative infinity as the logarithm).

Optimize me! (C, performance) -- followup to bit-twiddling question

Thanks to some very helpful stackOverflow users at Bit twiddling: which bit is set?, I have constructed my function (posted at the end of the question).
Any suggestions -- even small suggestions -- would be appreciated. Hopefully it will make my code better, but at the least it should teach me something. :)
Overview
This function will be called at least 1013 times, and possibly as often as 1015. That is, this code will run for months in all likelihood, so any performance tips would be helpful.
This function accounts for 72-77% of the program's time, based on profiling and about a dozen runs in different configurations (optimizing certain parameters not relevant here).
At the moment the function runs in an average of 50 clocks. I'm not sure how much this can be improved, but I'd be thrilled to see it run in 30.
Key Observation
If at some point in the calculation you can tell that the value that will be returned will be small (exact value negotiable -- say, below a million) you can abort early. I'm only interested in large values.
This is how I hope to save the most time, rather than by further micro-optimizations (though these are of course welcome as well!).
Performance Information
smallprimes is a bit array (64 bits); on average about 8 bits will be set, but it could be as few as 0 or as many as 12.
q will usually be nonzero. (Notice that the function exits early if q and smallprimes are zero.)
r and s will often be 0. If q is zero, r and s will be too; if r is zero, s will be too.
As the comment at the end says, nu is usually 1 by the end, so I have an efficient special case for it.
The calculations below the special case may appear to risk overflow, but through appropriate modeling I have proved that, for my input, this will not occur -- so don't worry about that case.
Functions not defined here (ugcd, minuu, star, etc.) have already been optimized; none take long to run. pr is a small array (all in L1). Also, all functions called here are pure functions.
But if you really care... ugcd is the gcd, minuu is the minimum, vals is the number of trailing binary 0s, __builtin_ffs is the location of the leftmost binary 1, star is (n-1) >> vals(n-1), pr is an array of the primes from 2 to 313.
The calculations are currently being done on a Phenom II 920 x4, though optimizations for i7 or Woodcrest are still of interest (if I get compute time on other nodes).
I would be happy to answer any questions you have about the function or its constituents.
What it actually does
Added in response to a request. You don't need to read this part.
The input is an odd number n with 1 < n < 4282250400097. The other inputs provide the factorization of the number in this particular sense:
smallprimes&1 is set if the number is divisible by 3, smallprimes&2 is set if the number is divisible by 5, smallprimes&4 is set if the number is divisible by 7, smallprimes&8 is set if the number is divisible by 11, etc. up to the most significant bit which represents 313. A number divisible by the square of a prime is not represented differently from a number divisible by just that number. (In fact, multiples of squares can be discarded; in the preprocessing stage in another function multiples of squares of primes <= lim have smallprimes and q set to 0 so they will be dropped, where the optimal value of lim is determined by experimentation.)
q, r, and s represent larger factors of the number. Any remaining factor (which may be greater than the square root of the number, or if s is nonzero may even be less) can be found by dividing factors out from n.
Once all the factors are recovered in this way, the number of bases, 1 <= b < n, to which n is a strong pseudoprime are counted using a mathematical formula best explained by the code.
Improvements so far
Pushed the early exit test up. This clearly saves work so I made the change.
The appropriate functions are already inline, so __attribute__ ((inline)) does nothing. Oddly, marking the main function bases and some of the helpers with __attribute ((hot)) hurt performance by almost 2% and I can't figure out why (but it's reproducible with over 20 tests). So I didn't make that change. Likewise, __attribute__ ((const)), at best, did not help. I was more than slightly surprised by this.
Code
ulong bases(ulong smallprimes, ulong n, ulong q, ulong r, ulong s)
{
if (!smallprimes & !q)
return 0;
ulong f = __builtin_popcountll(smallprimes) + (q > 1) + (r > 1) + (s > 1);
ulong nu = 0xFFFF; // "Infinity" for the purpose of minimum
ulong nn = star(n);
ulong prod = 1;
while (smallprimes) {
ulong bit = smallprimes & (-smallprimes);
ulong p = pr[__builtin_ffsll(bit)];
nu = minuu(nu, vals(p - 1));
prod *= ugcd(nn, star(p));
n /= p;
while (n % p == 0)
n /= p;
smallprimes ^= bit;
}
if (q) {
nu = minuu(nu, vals(q - 1));
prod *= ugcd(nn, star(q));
n /= q;
while (n % q == 0)
n /= q;
} else {
goto BASES_END;
}
if (r) {
nu = minuu(nu, vals(r - 1));
prod *= ugcd(nn, star(r));
n /= r;
while (n % r == 0)
n /= r;
} else {
goto BASES_END;
}
if (s) {
nu = minuu(nu, vals(s - 1));
prod *= ugcd(nn, star(s));
n /= s;
while (n % s == 0)
n /= s;
}
BASES_END:
if (n > 1) {
nu = minuu(nu, vals(n - 1));
prod *= ugcd(nn, star(n));
f++;
}
// This happens ~88% of the time in my tests, so special-case it.
if (nu == 1)
return prod << 1;
ulong tmp = f * nu;
long fac = 1 << tmp;
fac = (fac - 1) / ((1 << f) - 1) + 1;
return fac * prod;
}

You seem to be wasting much time doing divisions by the factors. It is much faster to replace a division with a multiplication by the reciprocal of divisor (division: ~15-80(!) cycles, depending on the divisor, multiplication: ~4 cycles), IF of course you can precompute the reciprocals.
While this seems unlikely to be possible with q, r, s - due to the range of those vars, it is very easy to do with p, which always comes from the small, static pr[] array. Precompute the reciprocals of those primes and store them in another array. Then, instead of dividing by p, multiply by the reciprocal taken from the second array. (Or make a single array of structs.)
Now, obtaining exact division result by this method requires some trickery to compensate for rounding errors. You will find the gory details of this technique in this document, on page 138.
EDIT:
After consulting Hacker's Delight (an excellent book, BTW) on the subject, it seems that you can make it even faster by exploiting the fact that all divisions in your code are exact (i.e. remainder is zero).
It seems that for every divisor d which is odd and base B = 2word_size, there exists a unique multiplicative inverse d⃰ which satisfies the conditions: d⃰ < B and d·d⃰ ≡ 1 (mod B). For every x which is an exact multiple of d, this implies x/d ≡ x·d⃰ (mod B). Which means you can simply replace a division with a multiplication, no added corrections, checks, rounding problems, whatever. (The proofs of these theorems can be found in the book.) Note that this multiplicative inverse need not be equal to the reciprocal as defined by the previous method!
How to check whether a given x is an exact multiple of d - i.e. x mod d = 0 ? Easy! x mod d = 0 iff x·d⃰ mod B ≤ ⌊(B-1)/d⌋. Note that this upper limit can be precomputed.
So, in code:
unsigned x, d;
unsigned inv_d = mulinv(d); //precompute this!
unsigned limit = (unsigned)-1 / d; //precompute this!
unsigned q = x*inv_d;
if(q <= limit)
{
//x % d == 0
//q == x/d
} else {
//x % d != 0
//q is garbage
}
Assuming the pr[] array becomes an array of struct prime:
struct prime {
ulong p;
ulong inv_p; //equal to mulinv(p)
ulong limit; //equal to (ulong)-1 / p
}
the while(smallprimes) loop in your code becomes:
while (smallprimes) {
ulong bit = smallprimes & (-smallprimes);
int bit_ix = __builtin_ffsll(bit);
ulong p = pr[bit_ix].p;
ulong inv_p = pr[bit_ix].inv_p;
ulong limit = pr[bit_ix].limit;
nu = minuu(nu, vals(p - 1));
prod *= ugcd(nn, star(p));
n *= inv_p;
for(;;) {
ulong q = n * inv_p;
if (q > limit)
break;
n = q;
}
smallprimes ^= bit;
}
And for the mulinv() function:
ulong mulinv(ulong d) //d needs to be odd
{
ulong x = d;
for(;;)
{
ulong tmp = d * x;
if(tmp == 1)
return x;
x *= 2 - tmp;
}
}
Note you can replace ulong with any other unsigned type - just use the same type consistently.
The proofs, whys and hows are all available in the book. A heartily recommended read :-).

If your compiler supports GCC function attributes, you can mark your pure functions with this attribute:
ulong star(ulong n) __attribute__ ((const));
This attribute indicates to the compiler that the result of the function depends only on its argument(s). This information can be used by the optimiser.
Is there a reason why you've opencoded vals() instead of using __builtin_ctz() ?

It is still somewhat unclear, what you are searching for. Quite frequently number theoretic problems allow huge speedups by deriving mathematical properties that the solutions must satisfiy.
If you are indeed searching for the integers that maximize the number of non-witnesses for the MR test (i.e. oeis.org/classic/A141768 that you mention) then it might be possible to use that the number of non-witnesses cannot be larger than phi(n)/4 and that the integers for which have this many non-witnesses are either are the product of two primes of the form
(k+1)*(2k+1)
or they are Carmichael numbers with 3 prime factors.
I'd think above some limit all integers in the sequence have this form and that it is possible to verify this by proving an upper bound for the witnesses of all other integers.
E.g. integers with 4 or more factors always have at most phi(n)/8 non-witnesses. Similar results can be derived from you formula for the number of bases for other integers.
As for micro-optimizations: Whenever you know that an integer is divisible by some quotient, then it is possible to replace the division by a multiplication with the inverse of the quotient modulo 2^64. And the tests n % q == 0 can be replaced by a test
n * inverse_q < max_q,
where inverse_q = q^(-1) mod 2^64 and max_q = 2^64 / q.
Obviously inverse_q and max_q need to be precomputed, to be efficient, but since you are using a sieve, I assume this should not be an obstacle.

Small optimization but:
ulong f;
ulong nn;
ulong nu = 0xFFFF; // "Infinity" for the purpose of minimum
ulong prod = 1;
if (!smallprimes & !q)
return 0;
// no need to do this operations before because of the previous return
f = __builtin_popcountll(smallprimes) + (q > 1) + (r > 1) + (s > 1);
nn = star(n);
BTW: you should edit your post to add star() and other functions you use definition

Try replacing this pattern (for r and q too):
n /= p;
while (n % p == 0)
n /= p;
With this:
ulong m;
...
m = n / p;
do {
n = m;
m = n / p;
} while ( m * p == n);
In my limited tests, I got a small speedup (10%) from eliminating the modulo.
Also, if p, q or r were constant, the compiler will replace the divisions by multiplications. If there are few choices for p, q or r, or if certain ones are more frequent, you might gain something by specializing the function for those values.

Have you tried using profile-guided optimisation?
Compile and link the program with the -fprofile-generate option, then run the program over a representative data set (say, a day's worth of computation).
Then re-compile and link it with the -fprofile-use option instead.

1) I would make the compiler spit out the assembly it generates and try and deduce if what it does is the best it can do... and if you spot problems, change the code so the assembly looks better. This way you can also make sure that functions you hope it'll inline (like star and vals) are really inlined. (You might need to add pragma's, or even turn them into macros)
2) It's great that you try this on a multicore machine, but this loop is singlethreaded. I'm guessing that there is an umbrella functions which splits the load across a few threads so that more cores are used?
3) It's difficult to suggest speed ups if what the actual function tries to calculate is unclear. Typically the most impressive speedups are not achieved with bit twiddling, but with a change in the algorithm. So a bit of comments might help ;^)
4) If you really want a speed up of 10* or more, check out CUDA or openCL which allows you to run C programs on your graphics hardware. It shines with functions like these!
5) You are doing loads of modulo and divides right after each other. In C this is 2 separate commands (first '/' and then '%'). However in assembly this is 1 command: 'DIV' or 'IDIV' which returns both the remainder and the quotient in one go:
B.4.75 IDIV: Signed Integer Divide
IDIV r/m8 ; F6 /7 [8086]
IDIV r/m16 ; o16 F7 /7 [8086]
IDIV r/m32 ; o32 F7 /7 [386]
IDIV performs signed integer division. The explicit operand provided is the divisor; the dividend and destination operands are implicit, in the following way:
For IDIV r/m8, AX is divided by the given operand; the quotient is stored in AL and the remainder in AH.
For IDIV r/m16, DX:AX is divided by the given operand; the quotient is stored in AX and the remainder in DX.
For IDIV r/m32, EDX:EAX is divided by the given operand; the quotient is stored in EAX and the remainder in EDX.
So it will require some inline assembly, but I'm guessing there'll be a significant speedup as there are a few places in your code which can benefit from this.

Make sure your functions get inlined. If they're out-of-line, the overhead might add up, especially in the first while loop. The best way to be sure is to examine the assembly.
Have you tried pre-computing star( pr[__builtin_ffsll(bit)] ) and vals( pr[__builtin_ffsll(bit)] - 1) ? That would trade some simple work for an array lookup, but it might be worth it if the tables are small enough.
Don't compute f until you actually need it (near the end, after your early-out). You can replace the code around BASES_END with something like
BASES_END:
ulong addToF = 0;
if (n > 1) {
nu = minuu(nu, vals(n - 1));
prod *= ugcd(nn, star(n));
addToF = 1;
}
// ... early out if nu == 1...
// ... compute f ...
f += addToF;
Hope that helps.

First some nitpicking ;-) you should be more careful about the types that you are using. In some places you seem to assume that ulong is 64 bit wide, use uint64_t there. And also for all other types, rethink carefully what you expect of them and use the appropriate type.
The optimization that I could see is integer division. Your code does that a lot, this is probably the most expensive thing you are doing. Division of small integers (uint32_t) maybe much more efficient than by big ones. In particular for uint32_t there is an assembler instruction that does division and modulo in one go, called divl.
If you use the appropriate types your compiler might do that all for you. But you'd better check the assembler (option -S to gcc) as somebody already said. Otherwise it is easy to include some little assembler fragments here and there. I found something like that in some code of mine:
register uint32_t a asm("eax") = 0;
register uint32_t ret asm("edx") = 0;
asm("divl %4"
: "=a" (a), "=d" (ret)
: "0" (a), "1" (ret), "rm" (divisor));
As you can see this uses special registers eax and edx and stuff like that...

Did you try a table lookup version of the first while loop? You could divide smallprimes in 4 16 bit values, look up their contribution and merge them. But maybe you need the side effects.

Did you try passing in an array of primes instead of splitting them in smallprimes, q, r and s? Since I don't know what the outer code does, I am probably wrong, but there is a chance that you also have a function to convert some primes to a smallprimes bitmap, and inside this function, you convert the bitmap back to an array of primes, effecively. In addition, you seem to do identical processing for elements of smallprimes, q, r, and s. It should save you a tiny amount of processing per call.
Also, you seem to know that the passed in primes divide n. Do you have enough knowledge outside about the power of each prime that divides n? You could save a lot of time if you can eliminate the modulo operation by passing in that information to this function. In other words, if n is pow(p_0,e_0)*pow(p_1,e_1)*...*pow(p_k,e_k)*n_leftover, and if you know more about these e_is and n_leftover, passing them in would mean a lot of things you don't have to do in this function.
There may be a way to discover n_leftover (the unfactored part of n) with less number of modulo operations, but it is only a hunch, so you may need to experiment with it a bit. The idea is to use gcd to remove known factors from n repeatedly until you get rid of all known prime factors. Let me give some almost-c-code:
factors=p_0*p_1*...*p_k*q*r*s;
n_leftover=n/factors;
do {
factors=gcd(n_leftover, factors);
n_leftover = n_leftover/factors;
} while (factors != 1);
I am not at all certain this will be better than the code you have, let alone the combined mod/div suggestions you can find in other answers, but I think it is worth a try. I feel that it will be a win, especially for numbers with high numbers of small prime factors.

You're passing in the complete factorization of n, so you're factoring consecutive integers and then using the results of that factorization here. It seems to me that you might benefit from doing some of this at the time of finding the factors.
BTW, I've got some really fast code for finding the factors you're using without doing any division. It's a little like a sieve but produces factors of consecutive numbers very quickly. Can find it and post if you think it may help.
edit had to recreate the code here:
#include
#define SIZE (1024*1024) //must be 2^n
#define MASK (SIZE-1)
typedef struct {
int p;
int next;
} p_type;
p_type primes[SIZE];
int sieve[SIZE];
void init_sieve()
{
int i,n;
int count = 1;
primes[1].p = 3;
sieve[1] = 1;
for (n=5;SIZE>n;n+=2)
{
int flag = 0;
for (i=1;count>=i;i++)
{
if ((n%primes[i].p) == 0)
{
flag = 1;
break;
}
}
if (flag==0)
{
count++;
primes[count].p = n;
sieve[n>>1] = count;
}
}
}
int main()
{
int ptr,n;
init_sieve();
printf("init_done\n");
// factor odd numbers starting with 3
for (n=1;1000000000>n;n++)
{
ptr = sieve[n&MASK];
if (ptr == 0) //prime
{
// printf("%d is prime",n*2+1);
}
else //composite
{
// printf ("%d has divisors:",n*2+1);
while(ptr!=0)
{
// printf ("%d ",primes[ptr].p);
sieve[n&MASK]=primes[ptr].next;
//move the prime to the next number it divides
primes[ptr].next = sieve[(n+primes[ptr].p)&MASK];
sieve[(n+primes[ptr].p)&MASK] = ptr;
ptr = sieve[n&MASK];
}
}
// printf("\n");
}
return 0;
}
The init function creates a factor base and initializes the sieve. This takes about 13 seconds on my laptop. Then all numbers up to 1 billion are factored or determined to be prime in another 25 seconds. Numbers less than SIZE are never reported as prime because they have 1 factor in the factor base, but that could be changed.
The idea is to maintain a linked list for every entry in the sieve. Numbers are factored by simply pulling their factors out of the linked list. As they are pulled out, they are inserted into the list for the next number that will be divisible by that prime. This is very cache friendly too. The sieve size must be larger than the largest prime in the factor base. As is, this sieve could run up to 2**40 in about 7 hours which seems to be your target (except for n needing to be 64 bits).
Your algorithm could be merged into this to make use of the factors as they are identified rather than packing bits and large primes into variables to pass to your function. Or your function could be changed to take the linked list (you could create a dummy link to pass in for the prime numbers outside the factor base).
Hope it helps.
BTW, this is the first time I've posted this algorithm publicly.

just a thought but maybe using your compilers optimization options would help, if you haven't already. another thought would be that if money isn't an issue you could use the Intel C/C++ compiler, assuming your using an Intel processor. I'd also assume that other processor manufacturers (AMD, etc.) would have similar compilers

If you are going to exit immediately on (!smallprimes&!q) why not do that test before even calling the function, and save the function call overhead?
Also, it seems like you effectively have 3 different functions which are linear except for the smallprimes loop.
bases1(s,n,q), bases2(s,n,q,r), and bases3(s,n,q,r,s).
It might be a win to actually create those as 3 separate functions without the branches and gotos, and call the appropriate one:
if (!(smallprimes|q)) { r = 0;}
else if (s) { r = bases3(s,n,q,r,s);}
else if (r) { r = bases2(s,n,q,r); }
else { r = bases1(s,n,q);
This would be most effective if previous processing has already given the calling code some 'knowledge' of which function to execute and you don't have to test for it.

If the divisions you're using are with numbers that aren’t known at compile time, but are used frequently at runtime (dividing by the same number many times), then I would suggest using the libdivide library, which basically implements at runtime the optimisations that compilers do for compile time constants (using shifts masks etc.). This can provide a huge benefit. Also avoiding using x % y == 0 for something like z = x/y, z * y == x as ergosys suggested above should also have a measurable improvement.

Does the code on your top post is the optimized version? If yes, there is still too many divide operations which greatly eat CPU cycles.
This code is overexecute innecessarily a bit
if (!smallprimes & !q)
return 0;
change to logical and &&
if (!smallprimes && !q)
return 0;
will make it short circuited faster without eveluating q
And the following code
ulong bit = smallprimes & (-smallprimes);
ulong p = pr[__builtin_ffsll(bit)];
which is used to find the last set bit of smallprimes. Why don't you use the simpler way
ulong p = pr[__builtin_ctz(smallprimes)];
Another culprit for decreased performance maybe too many program branching. You may consider changing to some other less-branch or branch-less equivalents

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight