Find the INDEX of element having max. absolute value using AVX512 instructions

Find the INDEX of element having max. absolute value using AVX512 instructions - c

I'm a newbie for coding using AVX512 instructions. My machine is Intel KNL 7250. I am trying to use AVX512 instructions to find the INDEX of the element having maximum absolute value, which double precision and size of array % 8 = 0. But it prints an output index = 0 every time. I don't know where is a problem, please help me.
Also, how to use printf for __m512i type?
Thanks.
Code:
void main()
{
int i;
int N=160;
double vec[N];
for(i=0;i<N;i++)
{
vec[i]=(double)(-i) ;
if(i==10)
{
vec[i] = -1127;
}
}
int max = avxmax_(N, vec);
printf("maxindex=%d\n", max);
}
int avxmax_(int N, double *X )
{
// return the index of element having maximum absolute value.
int maxindex, ix, i, M;
register __m512i increment, indices, maxindices, maxvalues, absmax, max_values_tmp, abs_max_tmp, tmp;
register __mmask8 gt;
double values_X[8];
double indices_X[8];
double maxvalue;
maxindex = 1;
if( N == 1) return(maxindex);
M = N % 8;
if( M == 0)
{
increment = _mm512_set1_epi64(8); // [8,8,8,8,8,8,8,8]
indices = _mm512_setr_epi64(0, 1, 2, 3, 4, 5, 6, 7);
maxindices = indices;
maxvalues = _mm512_loadu_si512(&X[0]);
absmax = _mm512_abs_epi64(maxvalues);
for( i = 8; i < N; i += 8)
{
// advance scalar indices: indices[0] + 8, indices[1] + 8,...,indices[7] + 8
indices = _mm512_add_epi64(indices, increment);
// compare
max_values_tmp = _mm512_loadu_si512(&X[i]);
abs_max_tmp = _mm512_abs_epi64(max_values_tmp);
gt = _mm512_cmpgt_epi64_mask(abs_max_tmp, absmax);
// update
maxindices = _mm512_mask_blend_epi64(gt, maxindices, indices);
absmax = _mm512_max_epi64(absmax, abs_max_tmp);
}
// scalar part
_mm512_storeu_si512((__m512i*)values_X, absmax);
_mm512_storeu_si512((__m512i*)indices_X, maxindices);
maxindex = indices_X[0];
maxvalue = values_X[0];
for(i = 1; i < 8; i++)
{
if(values_X[i] > maxvalue)
{
maxvalue = values_X[i];
maxindex = indices_X[i];
}
}
return(maxindex);
}
}

Your function returns 0 because you're treating the int64 index as the bit-pattern for a double, and converting that (tiny) number to an integer. double indices_X[8]; is the bug; should be uint64_t. There are other bugs, see below.
This bug is easier to spot if you declare variables as you use them, C99 style, not obsolete C89 style.
You _mm512_storeu_si512 the vector of int64_t indices into double indices_X[8], type-punning it to double, then in plain C do int maxindex = indices_X[0];. This is implicit type-conversion, converting that subnormal double to an integer.
(I noticed a mysterious vcvttsd2si FP->int conversion in the asm https://godbolt.org/z/zsfc36 while converting the code to C99 style variable declarations next to initializers. That was a clue: there should be no FP->int conversion in this function. I noticed that around the same time I was moving the double indices_X[8]; declaration down into the block that uses it, and noticing it had type double.)
It is actually possible to use integer operations on FP bit-patterns
But only if you use the right ones! IEEE754 exponent biases mean that the encoding / bit-pattern can be compared as a sign/magnitude integer. So you can do abs / min / max and compare on it, but not of course integer add / sub (unless you're implementing nextafter).
_mm512_abs_epi64 is 2's complement abs, not sign-magnitude. Instead, you must just mask off the sign bit. Then you're all set to treat the result as an unsigned integer or signed-2's-complement. (Either works because the high bit is clear.)
Using integer max has the interesting property that NaNs will compare higher than any other value, Inf below that, then finite values. So we get a NaN-propagating max-finder basically for free.
On KNL (Knight's Landing), FP vmaxpd and vcmppd have the same performance as their integer equivalents: 2 cycle latency, 0.5c throughput. (https://agner.org/optimize/). So your way has zero advantage on KNL, but it's a neat trick for mainstream Intel, like Skylake-X and IceLake.
Bugfixed optimized version:
Use size_t for return type and loop counters / indices to handle potentially huge arrays, instead of a random mix of int and 64-bit vector elements. (uint64_t for the temp array that collects the horizontal-max: it's always 64-bit even in a build with 32-bit pointers / size_t.)
bugfix: return 0 on N==1, not 1: the index of the only element is 0.
bugfix: return -1 on N%8 != 0, instead of falling off the end of the non-void function. (Undefined behaviour if the caller uses the result in C, or as soon as execution falls off the end in C++).
bugfix: abs of an FP value = clear the sign bit, not 2's complement abs on the bit-pattern
sort of bugfix: use unsigned integer compare and max, so it would work for 2's complement integers with _mm512_abs_epi64 (which produces an unsigned result; remember that -LONG_MIN overflows to LONG_MIN if you keep treating it as signed).
style improvement: if (N%8 != 0) return -1; instead of putting most of the body in an if block.
style improvement: declare vars when they're first used, and removed some unused ones that were pure noise. This is idiomatic for C since C99, which was standardized over 20 years ago.
style improvement: use simpler names for tmp vector vars that just hold a load result. Sometimes you just need a tmp var because intrinsic names are so long that you don't want to type _mm...load... as an arg for another intrinsics. A name like v scoped to the inner loop is a clear sign it's just a placeholder, not used later. (This style works best when you're declaring it as you init it, so it's easy to see it can't be used in an outer scope.)
optimization: reduce 8 -> 4 elements after the loop with SIMD: extract the high half, combine with existing low half. (Same as you would for a simpler horizontal reduction like sum or max). Inconvenient when we need instructions that only AVX512 has, but KNL doesn't have AVX512VL, so we have to use the 512-bit version and ignore the high garbage. But KNL does have AVX1 / AVX2 so we can still store 256-bit vectors and do some things.
Using a merge-masking _mm512_mask_extracti64x4_epi64 extract to blend the high half directly the low half of the same vector is a cool trick which compilers don't find if you use a 512-bit mask-blend. :P
sort of bugfix: in C, main has a return type of int in hosted implementations (running under an OS).
#include <immintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
// bugfix: indices can be larger than an int
size_t avxmax_(size_t N, double *X )
{
// return the index of element having maximum absolute value.
if( N == 1)
return 0; // bugfix: 0 is the only valid element in this case, not 1
if( N % 8 != 0) // [[unlikely]] // C++20
return -1; // bugfix: don't fall off the end of the function in this case
const __m512i fp_absmask = _mm512_set1_epi64(0x7FFFFFFFFFFFFFFF);
__m512i indices = _mm512_setr_epi64(0, 1, 2, 3, 4, 5, 6, 7);
__m512i maxindices = indices;
__m512i v = _mm512_loadu_si512(&X[0]);
__m512i absmax = _mm512_and_si512(v, fp_absmask);
for(size_t i = 8; i < N; i += 8) // [[likely]] // C++20
{
// advance indices by 8 each.
indices = _mm512_add_epi64(indices, _mm512_set1_epi64(8));
// compare
v = _mm512_loadu_si512(&X[i]);
__m512i vabs = _mm512_and_si512(v, fp_absmask);
// vabs = _mm512_abs_epi64(max_values_tmp); // for actual integers, not FP bit patterns
__mmask8 gt = _mm512_cmpgt_epu64_mask(vabs, absmax);
// update
maxindices = _mm512_mask_blend_epi64(gt, maxindices, indices);
absmax = _mm512_max_epu64(absmax, vabs);
}
// reduce 512->256; KNL doesn't have AVX512VL so some ops require 512-bit vectors
__m256i absmax_hi = _mm512_extracti64x4_epi64(absmax, 1);
__m512i absmax_hi512 = _mm512_castsi256_si512(absmax_hi); // free
__mmask8 gt = _mm512_cmpgt_epu64_mask(absmax_hi512, absmax);
__m256i abs256 = _mm512_castsi512_si256(_mm512_max_epu64(absmax_hi512, absmax)); // reduced to low 4 elements
// extract with merge-masking = blend
__m256i maxindices256 = _mm512_mask_extracti64x4_epi64(
_mm512_castsi512_si256(maxindices), gt, maxindices, 1);
// scalar part
double values_X[4];
uint64_t indices_X[4];
_mm256_storeu_si256((__m256i*)values_X, abs256);
_mm256_storeu_si256((__m256i*)indices_X, maxindices256);
size_t maxindex = indices_X[0];
double maxvalue = values_X[0];
for(int i = 1; i < 4; i++)
{
if(values_X[i] > maxvalue)
{
maxvalue = values_X[i];
maxindex = indices_X[i];
}
}
return maxindex;
}
On Godbolt: the main loop from GCC10.2 -O3 -march=knl is 8 instructions. So even if (best case) KNL could decode and run it at 2/clock, it's still taking 4 cycles per vector. You can run the program on Godbolt; it runs on Skylake-X servers so it can run AVX512 code. You can see it prints 10.
.L4:
vpandd zmm2, zmm5, ZMMWORD PTR [rsi+rax*8] # load, folded into AND
add rax, 8
vpcmpuq k1, zmm2, zmm0, 6
vpaddq zmm1, zmm1, zmm4 # increment current indices
cmp rdi, rax
vmovdqa64 zmm3{k1}, zmm1 # blend maxidx using merge-masking
vpmaxuq zmm0, zmm0, zmm2
ja .L4
vmovapd zmm1, zmm3 # silly missed optimization related to the case where the loop runs 0 times.
.L3:
vextracti64x4 ymm2, zmm0, 0x1 # high half of absmax
vpcmpuq k1, zmm2, zmm0, 6 # compare high and low
vpmaxuq zmm0, zmm0, zmm2
# vunpckhpd xmm2, xmm0, xmm0 # setting up for unrolled scalar loop
vextracti64x4 ymm1{k1}, zmm3, 0x1 # masked extract of indices
Another option for the loop would be a masked vpbroadcastq zmm3{k1}, rax, adding the [0..7] per-element offsets after the loop. That would actually save the vpaddq in the loop, and we have the right i in a register if GCC is going to use an indexed addressing-mode anyway. (That's not good on Skylake-X; defeats micro-fusion of the memory-source vpandd.)
Agner Fog doesn't list performance for GP->vector broadcasts, but hopefully it's only single-uop on KNL at least. (And https://uops.info/ doesn't have KNL or KNM results).
Branchy strategy: when a new max is very rare
If you expect finding a new max to be very rare (e.g. array is large and uniformly distributed, or at least not trending upwards), it could be even faster to broadcast the current max and branch on finding any greater vector element.
Finding a new max means branching out of the loop (which probably mispredicts, so that's slow) and broadcasting that element (probably with a tzcnt to find the element index, then a broadcast-load, and update the index).
Especially with KNL's 4-way SMT to hide branch miss costs, this could be an overall throughput win for large arrays; fewer instructions per element on average.
But probably significantly worse for inputs that do trend upwards, so a new max is found O(n) times on average, not sqrt(n) or log(n) or whatever frequency a uniform distribution would give us.
PS: to print vectors, store to an array and reload the elements. print a __m128i variable
Or use a debugger to show you their elements.

Related

Fastest Integer Square Root in the least amount of instructions

I am in a need of fast integer square root that does not involve any explicit division. The target RISC architecture can do operations like add, mul, sub, shift in one cycle (well - the operation's result is written in third cycle, really - but there's interleaving), so any Integer algorithm that uses these ops and is fast would be very appreciated.
This is what I have right now and I'm thinking that a binary search should be faster, since the following loop executes 16 times every single time (regardless of the value). I haven't debugged it extensively yet (but soon), so perhaps it's possible to have an early exit there:
unsigned short int int_sqrt32(unsigned int x)
{
unsigned short int res=0;
unsigned short int add= 0x8000;
int i;
for(i=0;i<16;i++)
{
unsigned short int temp=res | add;
unsigned int g2=temp*temp;
if (x>=g2)
{
res=temp;
}
add>>=1;
}
return res;
}
Looks like the current performance cost of the above [in the context of the target RISC] is a loop of 5 instructions (bitset, mul, compare, store, shift). Probably no space in cache to unroll fully (but this will be the prime candidate for a partial unroll [e.g. A loop of 4 rather than 16], for sure). So, the cost is 16*5 = 80 instructions (plus loop overhead, if not unrolled). Which, if fully interleaved, would cost only 80 (+2 for last instruction) cycles.
Can I get some other sqrt implementation (using only add, mul, bitshift, store/cmp) under 82 cycles?
FAQ:
Why don't you rely on the compiler to produce a good fast code?
There is no working C → RISC compiler for the platform. I will be porting the current reference C code into hand-written RISC ASM.
Did you profile the code to see if sqrt is actually a bottleneck?
No, there is no need for that. The target RISC chip is about twenty MHz, so every single instruction counts. The core loop (calculating the energy transfer form factor between the shooter and receiver patch), where this sqrt is used, will be run ~1,000 times each rendering frame (assuming it will be fast enough, of course), up to 60,000 per second, and roughly 1,000,000 times for whole demo.
Have you tried to optimize the algorithm to perhaps remove the sqrt?
Yes, I did that already. In fact, I got rid of 2 sqrts already and lots of divisions (removed or replaced by shifting). I can see a huge performance boost (compared to the reference float version) even on my gigahertz notebook.
What is the application?
It's a real-time progressive-refinement radiosity renderer for the compo demo. The idea is to have one shooting cycle each frame, so it would visibly converge and look better with each rendered frame (e.g. Up 60-times per second, though the SW rasterizer won't probably be that fast [but at least it can run on the other chip in parallel with the RISC - so if it takes 2-3 frames to render the scene, the RISC will have worked through 2-3 frames of radiosity data, in parallel]).
Why don't you work directly in target ASM?
Because radiosity is a slightly involved algorithm and I need the instant edit & continue debugging capability of Visual Studio. What I've done over the weekend in VS (couple hundred code changes to convert the floating-point math to integer-only) would take me 6 months on the target platform with only printing debugging".
Why can't you use a division?
Because it's 16-times slower on the target RISC than any of the following: mul, add, sub, shift, compare, load/store (which take just 1 cycle). So, it's used only when absolutely required (a couple times already, unfortunately, when shifting could not be used).
Can you use look-up tables?
The engine needs other LUTs already and copying from main RAM to RISC's little cache is prohibitively expensive (and definitely not each and every frame). But, I could perhaps spare 128-256 Bytes if it gave me at least a 100-200% boost for sqrt.
What's the range of the values for sqrt?
I managed to reduce it to mere unsigned 32-bit int (4,294,967,295)
EDIT1: I have ported two versions into the target RISC ASM, so I now have an exact count of ASM instructions during the execution (for the test scene).
Number of sqrt calls: 2,800.
Method1: The same method in this post (loop executing 16 times)
Method2: fred_sqrt (3c from http://www.azillionmonkeys.com/qed/sqroot.html)
Method1: 152.98 instructions per sqrt
Method2: 39.48 instructions per sqrt (with Final Rounding and 2 Newton iterations)
Method2: 21.01 instructions per sqrt (without Final Rounding and 2 Newton iterations)
Method2 uses LUT with 256 values, but since the target RISC can only use 32-bit access within its 4 KB cache, it actually takes 256*4 = 1 KB. But given its performance, I guess I will have to spare that 1 KB (out of 4).
Also, I have found out that there is NO visible visual artifact when I disable the Final rounding and two Newton iterations at the end (of Method2).
Meaning, the precision of that LUT is apparently good enough. Who knew...
The final cost is then 21.01 instructions per sqrt, which is almost ~order of magnitude faster than the very first solution. There's also possibility of reducing it further by sacrificing few of the 32 available registers for the constants used for the conditions and jump labels (each condition must fill 2 registers - one with the actual constant (only values less than 16 are allowed within CMPQ instruction, larger ones must be put into register) we are comparing against and second for the jump to the else label (the then jump is fall-through), as the direct relative jump is only possible within ~10 instructions (impossible with such large if-then-else chain, other than innermost 2 conditions).
EDIT2: ASM micro-optimizations
While benchmarking, I added counters for each of the 26 If.Then.Else codeblocks, to see if there aren't any blocks executed most often.
Turns out, that Blocks 0/10/11 are executed in 99.57%/99.57%/92.57% of cases. This means I can justify sacrificing 3 registers (out of 32) for those comparison constants (in those 3 blocks), e.g. r26 = $1.0000 r25 = $100.0000 r24 = $10.0000
This brought down the total instruction cost from 58,812 (avg:21.01) to 50,448 (avg:18.01)
So, now the average sqrt cost is just 18.01 ASM instructions (and there is no division!), though it will have to be inlined.
EDIT3: ASM micro-optimizations
Since we know that those 3 blocks (0/10/11) are executed most often, we can use local short jumps (16 Bytes in both directions, which is usually just couple of instructions (hence mostly useless), especially when the 6-byte MOVEI #jump_label, register is used during conditions) in those particular conditions. Of course, the Else condition will then incur additional 2 ops (that it would not have otherwise), but that's worth it. The block 10 will have to be swapped (Then block with Else block), which will make it harder to read and debug, but I documented the reasons profusely.
Now the total instruction cost (in test scene) is just 42,500 with an average of 15.18 ASM instructions per sqrt.
EDIT4: ASM micro-optimizations
Block 11 condition splits into innermost Blocks 12&13. It just so happens that those blocks don't need additional +1 math op, hence the local short jump can actually reach the Else block, if I merge bitshift right with the necessary bitshift left #2 (as all offsets within cache must be 32-bit). This saves on filling the jump register though I do need to sacrifice one more register r23 for the comparison value of $40.000.
The final cost is then 34,724 instructions with an average of 12.40 ASM instructions per sqrt.
I am also realizing that I could reshuffle the order of conditions (which will make the other range few ops more expensive, but that's happening for ~7% of cases only), favoring this particular range ($10.000, $40.000) first, saving on at least 1 or maybe even 2 conditions.
In which case, it should fall down to ~8.40 per sqrt.
I am realizing that the range depends directly on intensity of the light and the distance to the wall. Meaning, I have direct control over the RGB value of the light and distance from the wall. And while I would like the solution to be as generic as possible, given this realization (~12 ops per sqrt is mind-blowing), I will gladly sacrifice some flexibility in light colors if I can get sqrt this fast. Besides, there's maybe 10-15 different lights in whole demo, so I can simply find color combinations that eventually result in same sqrt range, but will get insanely fast sqrt. For sure, that's worth it to. And I still have a generic fallback (spanning entire int range) working just fine. Best of both worlds, really.

Have a look here.
For instance, at 3(a) there is this method, which is trivially adaptable to do a 64->32 bit square root, and also trivially transcribable to assembler:
/* by Jim Ulery */
static unsigned julery_isqrt(unsigned long val) {
unsigned long temp, g=0, b = 0x8000, bshft = 15;
do {
if (val >= (temp = (((g << 1) + b)<<bshft--))) {
g += b;
val -= temp;
}
} while (b >>= 1);
return g;
}
No divisions, no multiplications, bit shifts only. However, the time taken will be somewhat unpredictable particularly if you use a branch (on ARM RISC conditional instructions would work).
In general, this page lists ways to calculate square roots. If you happen to want to produce a fast inverse square root (i.e. x**(-0.5) ), or are just interested in amazing ways to optimise code, take a look at this, this and this.

This is the same as yours, but with fewer ops. (I count 9 ops in the loop in your code, including test and increment i in the for loop and 3 assignments, but perhaps some of those disappear when coded in ASM? There are 6 ops in the code below, if you count g*g>n as two (no assignment)).
int isqrt(int n) {
int g = 0x8000;
int c = 0x8000;
for (;;) {
if (g*g > n) {
g ^= c;
}
c >>= 1;
if (c == 0) {
return g;
}
g |= c;
}
}
I got it here. You can maybe eliminate a comparison if you unroll the loop and jump to the appropriate spot based on the highest non-zero bit in the input.
Update
I've been thinking more about using Newton's method. In theory, the number of bits of accuracy should double for each iteration. That means Newton's method is much worse than any of the other suggestions when there are few correct bits in the answer; however, the situation changes where there are a lot of correct bits in the answer. Considering that most suggestions seem to take 4 cycles per bit, that means that one iteration of Newton's method (16 cycles for division + 1 for addition + 1 for shift = 18 cycles) is not worthwhile unless it gives over 4 bits.
So, my suggestion is to build up 8 bits of the answer by one of the suggested methods (8*4 = 32 cycles) then perform one iteration of Newton's method (18 cycles) to double the number of bits to 16. That's a total of 50 cycles (plus maybe an extra 4 cycles to get 9 bits before applying Newton's method, plus maybe 2 cycles to overcome the overshoot occasionally experienced by Newton's method). That's a maximum of 56 cycles which as far as I can see rivals any of the other suggestions.
Second Update
I coded the hybrid algorithm idea. Newton's method itself has no overhead; you just apply and double the number of significant digits. The issue is to have a predictable number of significant digits before you apply Newton's method. For that, we need to figure out where the most significant bit of the answer will appear. Using a modification of the fast DeBruijn sequence method given by another poster, we can perform that calculation in about 12 cycles in my estimation. On the other hand, knowing the position of the msb of the answer speeds up all methods (average, not worst case), so it seems worthwhile no matter what.
After calculating the msb of the answer, I run a number of rounds of the algorithm suggested above, then finish it off with one or two rounds of Newton's method. We decide when to run Newton's method by the following calculation: one bit of the answer takes about 8 cycles according to calculation in the comments; one round of Newton's method takes about 18 cycles (division, addition, and shift, and maybe assignment), so we should only run Newton's method if we're going to get at least three bits out of it. So for 6 bit answers, we can run the linear method 3 times to get 3 significant bits, then run Newton's method 1 time to get another 3. For 15 bit answers, we run the linear method 4 times to get 4 bits, then Newton's method twice to get another 4 then another 7. And so on.
Those calculations depend on knowing exactly how many cycles are required to get a bit by the linear method vs. how many are required by Newton's method. If the "economics" change, e.g., by discovering a faster way to build up bits in a linear fashion, the decision of when to invoke Newton's method will change.
I unrolled the loops and implemented the decisions as switches, which I hope will translate into fast table lookups in assembly. I'm not absolutely sure that I've got the minimum number of cycles in each case, so maybe further tuning is possible. E.g., for s=10, you can try to get 5 bits then apply Newton's method once instead of twice.
I've tested the algorithm thoroughly for correctness. Some additional minor speedups are possible if you're willing to accept slightly incorrect answers in some cases. At least two cycles are used after applying Newton's method to correct an off-by-one error that occurs with numbers of the form m^2-1. And a cycle is used testing for input 0 at the beginning, as the algorithm can't handle that input. If you know you're never going to take the square root of zero you can eliminate that test. Finally, if you only need 8 significant bits in the answer, you can drop one of the Newton's method calculations.
#include <inttypes.h>
#include <stdint.h>
#include <stdbool.h>
#include <stdio.h>
uint32_t isqrt1(uint32_t n);
int main() {
uint32_t n;
bool it_works = true;
for (n = 0; n < UINT32_MAX; ++n) {
uint32_t sr = isqrt1(n);
if ( sr*sr > n || ( sr < 65535 && (sr+1)*(sr+1) <= n )) {
it_works = false;
printf("isqrt(%" PRIu32 ") = %" PRIu32 "\n", n, sr);
}
}
if (it_works) {
printf("it works\n");
}
return 0;
}
/* table modified to return shift s to move 1 to msb of square root of x */
/*
static const uint8_t debruijn32[32] = {
0, 31, 9, 30, 3, 8, 13, 29, 2, 5, 7, 21, 12, 24, 28, 19,
1, 10, 4, 14, 6, 22, 25, 20, 11, 15, 23, 26, 16, 27, 17, 18
};
*/
static const uint8_t debruijn32[32] = {
15, 0, 11, 0, 14, 11, 9, 1, 14, 13, 12, 5, 9, 3, 1, 6,
15, 10, 13, 8, 12, 4, 3, 5, 10, 8, 4, 2, 7, 2, 7, 6
};
/* based on CLZ emulation for non-zero arguments, from
* http://stackoverflow.com/questions/23856596/counting-leading-zeros-in-a-32-bit-unsigned-integer-with-best-algorithm-in-c-pro
*/
uint8_t shift_for_msb_of_sqrt(uint32_t x) {
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
x++;
return debruijn32 [x * 0x076be629 >> 27];
}
uint32_t isqrt1(uint32_t n) {
if (n==0) return 0;
uint32_t s = shift_for_msb_of_sqrt(n);
uint32_t c = 1 << s;
uint32_t g = c;
switch (s) {
case 9:
case 5:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 15:
case 14:
case 13:
case 8:
case 7:
case 4:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 12:
case 11:
case 10:
case 6:
case 3:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 2:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 1:
if (g*g > n) {
g ^= c;
}
c >>= 1;
g |= c;
case 0:
if (g*g > n) {
g ^= c;
}
}
/* now apply one or two rounds of Newton's method */
switch (s) {
case 15:
case 14:
case 13:
case 12:
case 11:
case 10:
g = (g + n/g) >> 1;
case 9:
case 8:
case 7:
case 6:
g = (g + n/g) >> 1;
}
/* correct potential error at m^2-1 for Newton's method */
return (g==65536 || g*g>n) ? g-1 : g;
}
In light testing on my machine (which admittedly is nothing like yours), the new isqrt1 routine runs about 40% faster on average than the previous isqrt routine I gave.

If multiplication is the same speed (or faster than!) addition and shifting, or if you lack a fast shift-by-amount-contained-in-a-register instruction, then the following will not be helpful. Otherwise:
You're computing temp*temp afresh on each loop cycle, but temp = res | add, which is the same as res + add since their bits don't overlap, and (a) you have already computed res*res on a previous loop cycle, and (b) add has special structure (it's always just a single bit). So, using the fact that (a+b)^2 = a^2 + 2ab + b^2, and you already have a^2, and b^2 is just a single bit shifted twice as far to the left as the single-bit b, and 2ab is just a left-shifted by 1 more position than the location of the single bit in b, you can get rid of the multiplication:
unsigned short int int_sqrt32(unsigned int x)
{
unsigned short int res = 0;
unsigned int res2 = 0;
unsigned short int add = 0x8000;
unsigned int add2 = 0x80000000;
int i;
for(i = 0; i < 16; i++)
{
unsigned int g2 = res2 + (res << i) + add2;
if (x >= g2)
{
res |= add;
res2 = g2;
}
add >>= 1;
add2 >>= 2;
}
return res;
}
Also I would guess that it's a better idea to use the same type (unsigned int) for all variables, since according to the C standard, all arithmetic requires promotion (conversion) of narrower integer types to the widest type involved before the arithmetic operation is performed, followed by subsequent back-conversion if necessary. (This may of course be optimised away by a sufficiently intelligent compiler, but why take the risk?)

From the comment trail, it seems that the RISC processor only provides 32x32->32 bit multiplication and 16x16->32 bit multiplication. A 32x-32->64 bit widening multiply, or a MULHI instruction returning the upper 32 bits of a 64-bit product is not provided.
This would seem to exclude approaches based on Newton-Raphson iteration, which would likely be inefficient, as they typically require either MULHI instruction or widening multiply for the intermediate fixed-point arithmetic.
The C99 code below uses a different iterative approach that requires only 16x16->32 bit multiplies, but converges somewhat linearly, requiring up to six iterations. This approach requires CLZ functionality to quickly determine a starting guess for the iterations. Asker stated in the comments that the RISC processor used does not provide CLZ functionality. So emulation of CLZ is required, and since the emulation adds to both storage and instruction count, this may make this approach uncompetitive. I performed a brute-force search to determine the deBruijn lookup table with the smallest multiplier.
This iterative algorithm delivers raw results quite close to the desired results, i.e. (int)sqrt(x), but always somewhat on the high side due to the truncating nature of integer arithmetic. To arrive at the final result, the result is conditionally decremented until the square of the result is less than or equal to the original argument.
The use of the volatile qualifier in the code only serves to establish that all named variables can in fact be allocated as 16-bit data without impacting the functionality. I do not know whether this provides any advantage, but noticed that the OP specifically used 16-bit variables in their code. For production code, volatile should be removed.
Note that for most processors, the correction steps at the end should not involve any branching. The product y*y can be subtracted from x with carry-out (or borrow-out), then y is corrected by a subtract with carry-in (or borrow-in). So each step should be a sequence MUL, SUBcc, SUBC.
Because implementation of the iteration by a loop incurs substantial overhead, I have elected to completely unroll the loop, but provide two early-out checks. Tallying the operations manually I count 46 operations for the fastest case, 54 operations for the average case, and 60 operations for the worst case.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
static const uint8_t clz_tab[32] = {
31, 22, 30, 21, 18, 10, 29, 2, 20, 17, 15, 13, 9, 6, 28, 1,
23, 19, 11, 3, 16, 14, 7, 24, 12, 4, 8, 25, 5, 26, 27, 0};
uint8_t clz (uint32_t a)
{
a |= a >> 16;
a |= a >> 8;
a |= a >> 4;
a |= a >> 2;
a |= a >> 1;
return clz_tab [0x07c4acdd * a >> 27];
}
/* 16 x 16 -> 32 bit unsigned multiplication; should be single instruction */
uint32_t umul16w (uint16_t a, uint16_t b)
{
return (uint32_t)a * b;
}
/* Reza Hashemian, "Square Rooting Algorithms for Integer and Floating-Point
Numbers", IEEE Transactions on Computers, Vol. 39, No. 8, Aug. 1990, p. 1025
*/
uint16_t isqrt (uint32_t x)
{
volatile uint16_t y, z, lsb, mpo, mmo, lz, t;
if (x == 0) return x; // early out, code below can't handle zero
lz = clz (x); // #leading zeros, 32-lz = #bits of argument
lsb = lz & 1;
mpo = 17 - (lz >> 1); // m+1, result has roughly half the #bits of argument
mmo = mpo - 2; // m-1
t = 1 << mmo; // power of two for two's complement of initial guess
y = t - (x >> (mpo - lsb)); // initial guess for sqrt
t = t + t; // power of two for two's complement of result
z = y;
y = (umul16w (y, y) >> mpo) + z;
y = (umul16w (y, y) >> mpo) + z;
if (x >= 0x40400) {
y = (umul16w (y, y) >> mpo) + z;
y = (umul16w (y, y) >> mpo) + z;
if (x >= 0x1002000) {
y = (umul16w (y, y) >> mpo) + z;
y = (umul16w (y, y) >> mpo) + z;
}
}
y = t - y; // raw result is 2's complement of iterated solution
y = y - umul16w (lsb, (umul16w (y, 19195) >> 16)); // mult. by sqrt(0.5)
if ((int32_t)(x - umul16w (y, y)) < 0) y--; // iteration may overestimate
if ((int32_t)(x - umul16w (y, y)) < 0) y--; // result, adjust downward if
if ((int32_t)(x - umul16w (y, y)) < 0) y--; // necessary
return y; // (int)sqrt(x)
}
int main (void)
{
uint32_t x = 0;
uint16_t res, ref;
do {
ref = (uint16_t)sqrt((double)x);
res = isqrt (x);
if (res != ref) {
printf ("!!!! x=%08x res=%08x ref=%08x\n", x, res, ref);
return EXIT_FAILURE;
}
x++;
} while (x);
return EXIT_SUCCESS;
}
Another possibility is to use the Newton iteration for the square root, despite the high cost of division. For small inputs only one iteration will be required. Although the asker did not state this, based on the execution time of 16 cycles for the DIV operation I strongly suspect that this is actually a 32/16->16 bit division which requires additional guard code to avoid overflow, defined as a quotient that does not fit into 16 bits. I have added appropriate safeguards to my code based on this assumption.
Since the Newton iteration doubles the number of good bits each time it is applied, we only need a low-precision initial guess which can easily be retrieved from a table based on the five leading bits of the argument. In order to grab these, we first normalize the argument into 2.30 fixed-point format with an additional implicit scale factor of 232-(lz & ~1) where lz are the number of leading zeros in the argument. As in the previous approach the iteration doesn't always deliver an accurate result, so a correction must be applied should the preliminary result be too big. I count 49 cycles for the fast path, 70 cycles for the slow path (average 60 cycles).
static const uint16_t sqrt_tab[32] =
{ 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
0x85ff, 0x8cff, 0x94ff, 0x9aff, 0xa1ff, 0xa7ff, 0xadff, 0xb3ff,
0xb9ff, 0xbeff, 0xc4ff, 0xc9ff, 0xceff, 0xd3ff, 0xd8ff, 0xdcff,
0xe1ff, 0xe6ff, 0xeaff, 0xeeff, 0xf3ff, 0xf7ff, 0xfbff, 0xffff
};
/* 32/16->16 bit division. Note: Will overflow if x[31:16] >= y */
uint16_t udiv_32_16 (uint32_t x, uint16_t y)
{
uint16_t r = x / y;
return r;
}
/* table lookup for initial guess followed by division-based Newton iteration*/
uint16_t isqrt (uint32_t x)
{
volatile uint16_t q, lz, y, i, xh;
if (x == 0) return x; // early out, code below can't handle zero
// initial guess based on leading 5 bits of argument normalized to 2.30
lz = clz (x);
i = ((x << (lz & ~1)) >> 27);
y = sqrt_tab[i] >> (lz >> 1);
xh = (x >> 16); // needed for overflow check on division
// first Newton iteration, guard against overflow in division
q = 0xffff;
if (xh < y) q = udiv_32_16 (x, y);
y = (q + y) >> 1;
if (lz < 10) {
// second Newton iteration, guard against overflow in division
q = 0xffff;
if (xh < y) q = udiv_32_16 (x, y);
y = (q + y) >> 1;
}
if (umul16w (y, y) > x) y--; // adjust quotient if too large
return y; // (int)sqrt(x)
}

I don't know how to turn it into an efficient algorithm but when I investigated this in the '80s an interesting pattern emerged. When rounding square roots, there are two more integers with that square root than the preceding one (after zero).
So, one number (zero) has a square root of zero, two have a square root of 1 (1 and 2), 4 have a square root of two (3, 4, 5 and 6) and so on. Probably not a useful answer but interesting nonetheless.

Here is a less incremental version of the technique #j_random_hacker described. On at least one processor it was just a bit faster when I fiddled with this a couple of years ago. I have no idea why.
// assumes unsigned is 32 bits
unsigned isqrt1(unsigned x) {
unsigned r = 0, r2 = 0;
for (int p = 15; p >= 0; --p) {
unsigned tr2 = r2 + (r << (p + 1)) + (1u << (p + p));
if (tr2 <= x) {
r2 = tr2;
r |= (1u << p);
}
}
return r;
}
/*
gcc 6.3 -O2
isqrt(unsigned int):
mov esi, 15
xor r9d, r9d
xor eax, eax
mov r8d, 1
.L3:
lea ecx, [rsi+1]
mov edx, eax
mov r10d, r8d
sal edx, cl
lea ecx, [rsi+rsi]
sal r10d, cl
add edx, r10d
add edx, r9d
cmp edx, edi
ja .L2
mov r11d, r8d
mov ecx, esi
mov r9d, edx
sal r11d, cl
or eax, r11d
.L2:
sub esi, 1
cmp esi, -1
jne .L3
rep ret
*/
If you turn up gcc 9 x86 optimization, it completely unrolls the loop and folds constants. The result is still only about 100 instructions.

How to optimize C code : looking for the next set bit and finding sum of corresponding array elements

EDIT: Now I realize I didn't explain my algorithm well enough. I'll try again.
What I'm doing is something very similar to dot product of two vectors, but there is a difference. I've got two vectors: one vector of bits and one vector of floats of the same length. So I need to calculate sum:
float[0]*bit[0]+float[1]*bit[1]+..+float[N-1]*bit[N-1], BUT the difference from a classic dot product is that I need to skip some fixed number of elements after each set bit.
Example:
vector of floats = {1.5, 2.0, 3.0, 4.5, 1.0}
vector of bits = {1, 0, 1, 0, 1 }
nSkip = 2
in this case sum is calculated as follows:
sum = floats[0]*bits[0]
bits[0] == 1, so skipping 2 elements (at positions 1 and 2)
sum = sum + floats[3]*bits[3]
bits[3] == 0, so no skipping
sum = sum + floats[4]*bits[4]
result = 1.5*1+4.5*0+1.0*1 = 2.5
The following code is called many times with different data so I need to optimize it to run as fast as possible on my Core i7 (I don't care much about compatibility with anything else). It is optimized to some extent but still slow, but I don't know how to further improve it.
Bit array is implemented as an array of 64 bit unsigned ints, it allows me to use bitscanforward to find the next set bit.
code:
unsigned int i = 0;
float fSum = 0;
do
{
unsigned int nAddr = i / 64;
unsigned int nShift = i & 63;
unsigned __int64 v = bitarray[nAddr] >> nShift;
unsigned long idx;
if (!_BitScanForward64(&idx, v))
{
i+=64-nShift;
continue;
}
i+= idx;
fSum += floatarray[i];
i+= nSkip;
} while(i<nEnd);
Profiler shows 3 slowest hotspots :
1. v = bitarray[nAddr] >> nShift (memory access with shift)
2. _BitScanForward64(&idx, v)
3. fSum += floatarray[i]; (memory access)
But probably there is a different way of doing this. I was thinking about just resetting nSkip bits after each set bit in the bit vector and then calculating classical dot product - didn't try yet but honestly don't belive it will be faster with more memory access.

You have too many of your operations inside of the loop. You also only have one loop, so many of the operations that do need to happen for each flag word (the 64 bit unsigned integer) are happening 63 extra times.
Consider division an expensive operation and try to not do that too often when optimizing code for performance.
Memory access is also considered expensive in terms of how long it takes, so this should also be limited to required accesses only.
Tests that allow you to exit early are often useful (though sometimes the test itself is expensive relative to the operations you'd be avoiding, but that's probably not the case here.
Using nested loops should simplify this a lot. The outer loop should work at the 64 bit word level, and the inner loop should work at the bit level.
I have noticed a mistake in my earlier recommendations. Since the division here is by 64, which is a power of 2, this is not actually an expensive operation, but we still need to get as many operations as far out of the loops as we can.
/* this is completely untested, but incorporates the optimizations
that I outlined as well as a few others.
I process the arrays backwards, which allows for elimination of
comparisons of variables against other variables, which is much
slower than comparisons of variables against 0, which is essentially
free on many processors when you have just operated or loaded the
value to a register.
Going backwards at the bit level also allows for the possibility that
the compiler will take advantage of the comparison of the top bit
being the same as test for negative, which is cheap and mostly free
for all but the first time through the inner loop (for each time
through the outer loop.
*/
double acc = 0.0;
unsigned i_end = nEnd-1;
unsigned i_bit;
int i_word_end;
if (i_end == 0)
{
return acc;
}
i_bit = i_end % 64;
i_word = i_end / 64;
do
{
unsigned __int64 v = bitarray[i_word_end];
unsigned i_upper = i_word_end << 64;
while (v)
{
if (v & 0x80000000000000)
{
// The following code is semantically the same as
// unsigned i = i_bit_end + (i_word_end * sizeof(v));
unsigned i = i_bit_end | i_upper;
acc += floatarray[i];
}
v <<= 1;
i--;
}
i_bit_end = 63;
i_word_end--;
} while (i_word_end >= 0);

I think you should check "how to ask questions" first. You will not gain many upvotes for this, since you are asking us to do the work for you instead of introducing a particular problem.
I cannot see why you are incrementing the same variable in two places instead of one (i).
Also think you should declare variables only once, not in every iteration.

Removing slow int64 division from fixed point atan2() approximation

I made a function to compute a fixed-point approximation of atan2(y, x). The problem is that of the ~83 cycles it takes to run the whole function, 70 cycles (compiling with gcc 4.9.1 mingw-w64 -O3 on an AMD FX-6100) are taken entirely by a simple 64-bit integer division! And sadly none of the terms of that division are constant. Can I speed up the division itself? Is there any way I can remove it?
I think I need this division because since I approximate atan2(y, x) with a 1D lookup table I need to normalise the distance of the point represented by x,y to something like a unit circle or unit square (I chose a unit 'diamond' which is a unit square rotated by 45°, which gives a pretty even precision across the positive quadrant). So the division finds (|y|-|x|) / (|y|+|x|). Note that the divisor is in 32-bits while the numerator is a 32-bit number shifted 29 bits right so that the result of the division has 29 fractional bits. Also using floating point division is not an option as this function is required not to use floating point arithmetic.
Any ideas? I can't think of anything to improve this (and I can't figure out why it takes 70 cycles just for a division). Here's the full function for reference:
int32_t fpatan2(int32_t y, int32_t x) // does the equivalent of atan2(y, x)/2pi, y and x are integers, not fixed point
{
#include "fpatan.h" // includes the atan LUT as generated by tablegen.exe, the entry bit precision (prec), LUT size power (lutsp) and how many max bits |b-a| takes (abdp)
const uint32_t outfmt = 32; // final output format in s0.outfmt
const uint32_t ofs=30-outfmt, ds=29, ish=ds-lutsp, ip=30-prec, tp=30+abdp-prec, tmask = (1<<ish)-1, tbd=(ish-tp); // ds is the division shift, the shift for the index, bit precision of the interpolation, the mask, the precision for t and how to shift from p to t
const uint32_t halfof = 1UL<<(outfmt-1); // represents 0.5 in the output format, which since it is in turns means half a circle
const uint32_t pds=ds-lutsp; // division shift and post-division shift
uint32_t lutind, p, t, d;
int32_t a, b, xa, ya, xs, ys, div, r;
xs = x >> 31; // equivalent of fabs()
xa = (x^xs) - xs;
ys = y >> 31;
ya = (y^ys) - ys;
d = ya+xa;
if (d==0) // if both y and x are 0 then they add up to 0 and we must return 0
return 0;
// the following does 0.5 * (1. - (y-x) / (y+x))
// (y+x) is u1.31, (y-x) is s0.31, div is in s1.29
div = ((int64_t) (ya-xa)<<ds) / d; // '/d' normalises distance to the unit diamond, immediate result of division is always <= +/-1^ds
p = ((1UL<<ds) - div) >> 1; // before shift the format is s2.29. position in u1.29
lutind = p >> ish; // index for the LUT
t = (p & tmask) >> tbd; // interpolator between two LUT entries
a = fpatan_lut[lutind];
b = fpatan_lut[lutind+1];
r = (((b-a) * (int32_t) t) >> abdp) + (a<<ip); // linear interpolation of a and b by t in s0.32 format
// Quadrants
if (xs) // if x was negative
r = halfof - r; // r = 0.5 - r
r = (r^ys) - ys; // if y was negative then r is negated
return r;
}

Unfortunately a 70 cycles latency is typical for a 64-bit integer division on x86 CPUs. Floating point division typically has about half the latency or less. The increased cost comes from the fact modern CPUs only have dividers in their floating point execution units (they're very expensive in terms silicon area), so need to convert the integers to floating point and back again. So just substituting a floating division in place of the integer one isn't likely to help. You'll need to refactor your code to use floating point instead to take advantage of faster floating point division.
If you're able to refactor your code you might also be able to benefit from the approximate floating-point reciprocal instruction RCPSS, if you don't need an exact answer. It has a latency of around 5 cycles.

Based on #Iwillnotexist Idonotexist's suggestion to use lzcnt, reciprocity and multiplication I implemented a division function that runs in about 23.3 cycles and with a pretty great precision of 1 part in 19 million with a 1.5 kB LUT, e.g. one of the worst cases being for 1428769848 / 1080138864 you might get 1.3227648959 instead of 1.3227649663.
I figured out an interesting technique while researching this, I was really struggling to think of something that could be fast and precise enough, as not even a quadratic approximation of 1/x in [0.5 , 1.0) combined with an interpolated difference LUT would do, then I had the idea of doing it the other way around so I made a lookup table that contains the quadratic coefficients that fit the curve on a short segment that represents 1/128th of the [0.5 , 1.0) curve, which gives you a very small error like so. And using the 7 most significant bits of what represents x in the [0.5 , 1.0) range as a LUT index I directly get the coefficients that work best for the segment that x falls into.
Here's the full code with the lookup tables ffo_lut.h and fpdiv.h:
#include "ffo_lut.h"
static INLINE int32_t log2_ffo32(uint32_t x) // returns the number of bits up to the most significant set bit so that 2^return > x >= 2^(return-1)
{
int32_t y;
y = x>>21; if (y) return ffo_lut[y]+21;
y = x>>10; if (y) return ffo_lut[y]+10;
return ffo_lut[x];
}
// Usage note: for fixed point inputs make outfmt = desired format + format of x - format of y
// The caller must make sure not to divide by 0. Division by 0 causes a crash by negative index table lookup
static INLINE int64_t fpdiv(int32_t y, int32_t x, int32_t outfmt) // ~23.3 cycles, max error (by division) 53.39e-9
{
#include "fpdiv.h" // includes the quadratic coefficients LUT (1.5 kB) as generated by tablegen.exe, the format (prec=27) and LUT size power (lutsp)
const int32_t *c;
int32_t xa, xs, p, sh;
uint32_t expon, frx, lutind;
const uint32_t ish = prec-lutsp-1, cfs = 31-prec, half = 1L<<(prec-1); // the shift for the index, the shift for 31-bit xa, the value of 0.5
int64_t out;
int64_t c0, c1, c2;
// turn x into xa (|x|) and sign of x (xs)
xs = x >> 31;
xa = (x^xs) - xs;
// decompose |x| into frx * 2^expon
expon = log2_ffo32(xa);
frx = (xa << (31-expon)) >> cfs; // the fractional part is now in 0.27 format
// lookup the 3 quadratic coefficients for c2*x^2 + c1*x + c0 then compute the result
lutind = (frx - half) >> ish; // range becomes [0, 2^26 - 1], in other words 0.26, then >> (26-lutsp) so the index is lutsp bits
lutind *= 3; // 3 entries for each index
c = &fpdiv_lut[lutind]; // c points to the correct c0, c1, c2
c0 = c[0]; c1 = c[1]; c2 = c[2];
p = (int64_t) frx * frx >> prec; // x^2
p = c2 * p >> prec; // c2 * x^2
p += c1 * frx >> prec; // + c1 * x
p += c0; // + c0, p = (1.0 , 2.0] in 2.27 format
// apply the necessary bit shifts and reapplies the original sign of x to make final result
sh = expon + prec - outfmt; // calculates the final needed shift
out = (int64_t) y * p; // format is s31 + 1.27 = s32.27
if (sh >= 0)
out >>= sh;
else
out <<= -sh;
out = (out^xs) - xs; // if x was negative then out is negated
return out;
}
I think ~23.3 cycles is about as good as it's gonna get for what it does, but if you have any ideas to shave a few cycles off please let me know.
As for the fpatan2() question the solution would be to replace this line:
div = ((int64_t) (ya-xa)<<ds) / d;
with that line:
div = fpdiv(ya-xa, d, ds);

Yours time hog instruction:
div = ((int64_t) (ya-xa)<<ds) / d;
exposes at least two issues. The first one is that you mask the builtin div function; but this is minor fact, could be never observed. The second one is that first, according to C language rules, both operands are converted to common type which is int64_t, and, then, division for this type is expanded into CPU instruction which divides 128-bit dividend by 64-bit divisor(!) Extract from assembly of cut-down version of your function:
21: 48 89 c2 mov %rax,%rdx
24: 48 c1 fa 3f sar $0x3f,%rdx ## this is sign bit extension
28: 48 f7 fe idiv %rsi
Yep, this division requires about 70 cycles and can't be optimized (well, really it can, but e.g. reverse divisor approach requires multiplication with 192-bit product). But if you are sure this division can be done with 64-bit dividend and 32-bit divisor and it won't overflow (quotient will fit into 32 bits) (I agree because ya-xa is always less by absolute value than ya+xa), this can be sped up using explicit assembly request:
uint64_t tmp_num = ((int64_t) (ya-xa))<<ds;
asm("idivl %[d]" :
[a] "=a" (div1) :
"[a]" (tmp_num), "d" (tmp_num >> 32), [d] "q" (d) :
"cc");
this is quick&dirty and shall be carefully verified, but I hope the idea is understood. The resulting assembly now looks like:
18: 48 98 cltq
1a: 48 c1 e0 1d shl $0x1d,%rax
1e: 48 89 c2 mov %rax,%rdx
21: 48 c1 ea 20 shr $0x20,%rdx
27: f7 f9 idiv %ecx
This seems to be huge advance because 64/32 division requires up to 25 clock cycles on Core family, according to Intel optimization manual, instead of 70 you see for 128/64 division.
More minor approvements can be added; e.g. shifts can be done yet more economically in parallel:
uint32_t diff = ya - xa;
uint32_t lowpart = diff << 29;
uint32_t highpart = diff >> 3;
asm("idivl %[d]" :
[a] "=a" (div1) :
"[a]" (lowpart), "d" (highpart), [d] "q" (d) :
"cc");
which results in:
18: 89 d0 mov %edx,%eax
1a: c1 e0 1d shl $0x1d,%eax
1d: c1 ea 03 shr $0x3,%edx
22: f7 f9 idiv %ecx
but this is minor fix, compared to the division-related one.
To conclude, I really doubt this routine is worth to be implemented in C language. The latter is quite ineconomical in integer arithmetic, requiring useless expansions and high part losses. The whole routine is worth to be moved to assembler.

Given an fpatan() implementation, you could simply implement fpatan2() in terms of that.
Assuming constants defined for pi abd pi/2:
int32_t fpatan2( int32_t y, int32_t x)
{
fixed theta ;
if( x == 0 )
{
theta = y > 0 ? fixed_half_pi : -fixed_half_pi ;
}
else
{
theta = fpatan( y / x ) ;
if( x < 0 )
{
theta += ( y < 0 ) ? -fixed_pi : fixed_pi ;
}
}
return theta ;
}
Note that fixed library implementations are easy to get very wrong. You might take a look at Optimizing Math-Intensive Applications with Fixed-Point Arithmetic. The use of C++ in the library under discussion makes the code much simpler, in most cases you can just replace the float or double keyword with fixed. It does not however have an atan2() implementation, the code above is adapted from my implementation for that library.

How to properly add/subtract a 128-bit number (as two uint64_t)?

I'm working in C and need to add and subtract a 64-bit number and a 128-bit number. The result will be held in the 128-bit number. I am using an integer array to store the upper and lower halves of the 128-bit number (i.e. uint64_t bigNum[2], where bigNum[0] is the least significant).
Can anybody help with an addition and subtraction function that can take in bigNum and add/subtract a uint64_t to it?
I have seen many incorrect examples on the web, so consider this:
bigNum[0] = 0;
bigNum[1] = 1;
subtract(&bigNum, 1);
At this point bigNum[0] should have all bits set, while bigNum[1] should have no bits set.

In many architectures it's very easy to add/subtract any arbitrarily-long integers because there's a carry flag and add/sub-with-flag instruction. For example on x86 rdx:rax += r8:r9 can be done like this
add rax, r9 # add the low parts and store the carry
adc rdx, r8 # add the high parts with carry
In C there's no way to access this carry flag so you must calculate the flag on your own. The easiest way is to check if the unsigned sum is less than either of the operand like this. For example to do a += b we'll do
aL += bL;
aH += bH + (aL < bL);
This is exactly how multi-word add is done in architectures that don't have a flag register. For example in MIPS it's done like this
# alow = blow + clow
addu alow, blow, clow
# set tmp = 1 if alow < clow, else 0
sltu tmp, alow, clow
addu ahigh, bhigh, chigh
addu ahigh, ahigh, tmp
Here's some example assembly output

This should work for the subtraction:
typedef u_int64_t bigNum[2];
void subtract(bigNum *a, u_int64_t b)
{
const u_int64_t borrow = b > a[1];
a[1] -= b;
a[0] -= borrow;
}
Addition is very similar. The above could of course be expressed with an explicit test, too, but I find it cleaner to always do the borrowing. Optimization left as an exercise.
For a bigNum equal to { 0, 1 }, subtracting two would make it equal { ~0UL, ~0UL }, which is the proper bit pattern to represent -1. Here, UL is assumed to promote an integer to 64 bits, which is compiler-dependent of course.

In grade 1 or 2, you should have learn't how to break down the addition of 1 and 10 into parts, by splitting it into multiple separate additions of tens and units. When dealing with big numbers, the same principals can be applied to compute arithmetic operations on arbitrarily large numbers, by realizing your units are now units of 2^bits, your "tens" are 2^bits larger and so on.

For the case the value that your are subtracting is less or equal to bignum[0] you don't have to touch bignum[1].
If it isn't, you subtract it from bignum[0], anyhow. This operation will wrap around, but this is the behavior you need here. In addition you'd then have to substact 1 from bignum[1].

Most compilers support a __int128 type intrinsically.
Try it and you might be lucky.

speeding up "base conversion" for large integers

I am using a base-conversion algorithm to generate a permutation from a large integer (split into 32-bit words).
I use a relatively standard algorithm for this:
/* N = count,K is permutation index (0..N!-1) A[N] contains 0..N-1 */
i = 0;
while (N > 1) {
swap A[i] and A[i+(k%N)]
k = k / N
N = N - 1
i = i + 1
}
Unfortunately, the divide and modulo each iteration adds up, especially moving to large integers - But, it seems I could just use multiply!
/* As before, N is count, K is index, A[N] contains 0..N-1 */
/* Split is arbitrarily 128 (bits), for my current choice of N */
/* "Adjust" is precalculated: (1 << Split)/(N!) */
a = k*Adjust; /* a can be treated as a fixed point fraction */
i = 0;
while (N > 1) {
a = a*N;
index = a >> Split;
a = a & ((1 << Split) - 1); /* actually, just zeroing a register */
swap A[i] and A[i+index]
N = N - 1
i = i + 1
}
This is nicer, but doing large integer multiplies is still sluggish.
Question 1:
Is there a way of doing this faster?
Eg. Since I know that N*(N-1) is less than 2^32, could I pull out those numbers from one word, and merge in the 'leftovers'?
Or, is there a way to modify an arithetic decoder to pull out the indicies one at a time?
Question 2:
For the sake of curiosity - if I use multiplication to convert a number to base 10 without the adjustment, then the result is multiplied by (10^digits/2^shift). Is there a tricky way to remove this factor working with the decimal digits? Even with the adjustment factor, this seems like it would be faster -- why wouldn't standard libraries use this vs divide and mod?

Seeing that you are talking about numbers like 2^128/(N!), it seems that in your problem N is going to be rather small (N < 35 according to my calculations).
I suggest taking the original algorithm as a starting point; first switch the direction of the loop:
i = 2;
while (i < N) {
swap A[N - 1 - i] and A[N - i + k % i]
k = k / i
i = i + 1
}
Now change the loop to do several permutations per iteration. I guess the speed of division is the same regardless of the number i, as long as i < 2^32.
Split the range 2...N-1 into sub-ranges so that the product of the numbers in each sub-range is less than 2^32:
2, 3, 4, ..., 12: product is 479001600
13, 14, ..., 19: product is 253955520
20, 21, ..., 26: product is 3315312000
27, 28, ..., 32: product is 652458240
33, 34, 35: product is 39270
Then, divide the long number k by the products instead of dividing by i. Each iteration will yield a remainder (less than 2^32) and a smaller number k. When you have the remainder, you can work with it in an inner loop using the original algorithm; which will now be faster because it doesn't involve long division.
Here is some code:
static const int rangeCount = 5;
static const int rangeLimit[rangeCount] = {13, 20, 27, 33, 36};
static uint32_t rangeProduct[rangeCount] = {
479001600,
253955520,
3315312000,
652458240,
39270
};
for (int rangeIndex = 0; rangeIndex < rangeCount; ++rangeIndex)
{
// The following two lines involve long division;
// math libraries probably calculate both quotient and remainder
// in one function call
uint32_t rangeRemainder = k % rangeProduct[rangeIndex];
k /= rangeProduct[rangeIndex];
// A range starts where the previous range ended
int rangeStart = (rangeIndex == 0) ? 2 : rangeLimit[rangeIndex - 1];
// Iterate over range
for (int i = rangeStart; i < rangeLimit[rangeIndex] && i < n; ++i)
{
// The following two lines involve a 32-bit division;
// it produces both quotient and remainder in one Pentium instruction
int remainder = rangeRemainder % i;
rangeRemainder /= i;
std::swap(permutation[n - 1 - i], permutation[n - i + remainder]);
}
}
Of course, this code can be extended into more than 128 bits.
Another optimization could involve extraction of powers of 2 from the products of ranges; this might add a slight speedup by making the ranges longer. Not sure whether this is worthwhile (maybe for large values of N, like N=1000).

Dont know about algorithms, but the ones you use seems pretty simple, so i dont really see how you can optimize the algorithm.
You may use alternative approaches:
use ASM (assembler) - from my experience, after a long time trying to figure out how should a certain algorithm would be written in ASM, it ended up being slower than the version generated by the compiler:) Probably because the compiler also knows how to layout the code so the CPU cache would be more efficient, and/or what instructions are actually faster and what situations(this was on GCC/linux).
use multi-processing:
make your algorithm multithreaded, and make sure you run with the same number of threads as the number of available cpu cores(most cpu's nowdays do have multiple cores/multithreading)
make you algorithm capable of running on multiple machines on a network, and devise a way of sending these numbers to machines in a network, so you may use their CPU power.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Find the INDEX of element having max. absolute value using AVX512 instructions - c

Related

Fastest Integer Square Root in the least amount of instructions

How to optimize C code : looking for the next set bit and finding sum of corresponding array elements

Removing slow int64 division from fixed point atan2() approximation

How to properly add/subtract a 128-bit number (as two uint64_t)?

speeding up "base conversion" for large integers

Categories

Resources