Where can I find soft-multiply and divide algorithms? - c

I'm working on a micro-controller without hardware multiply and divide. I need to cook up software algorithms for these basic operations that are a nice balance of compact size and efficiency. My C compiler port will employ these algos, not the the C developers themselves.
My google-fu is so far turning up mostly noise on this topic.
Can anyone point me to something informative? I can use add/sub and shift instructions. Table lookup based algos might also work for me, but I'm a bit worried about cramming so much into the compiler's back-end...um, so to speak.

Here's a simple multiplication algorithm:
Start with rightmost bit of multiplier.
If bit in multiplier is 1, add multiplicand
Shift multiplicand by 1
Move to next bit in multiplier and go back to step 2.
And here's a division algorithm:
If divisor is larger than dividend, stop.
While divisor register is less than dividend register, shift left.
Shift divisor register right by 1.
Subtract divisor register from dividend register and change the bit to 1 in the result register at the bit that corresponds with the total number of shifts done to the divisor register.
Start over at step 1 with divisor register in original state.
Of course you'll need to put in a check for dividing by 0, but it should work.
These algorithms, of course, are only for integers.

My favorite reference for things like this, available in book form:
http://www.hackersdelight.org/
Also you can't go wrong with TAoCP: http://www-cs-faculty.stanford.edu/~uno/taocp.html

Here's a division algorithm: http://www.prasannatech.net/2009/01/division-without-division-operator_24.html
I assume we're talking about ints?
If there's no hardware support, you'll have to implement your own divide-by-zero exception.
(I didn't have much luck quickly finding a multiplication algorithm, but I'll keep looking if someone else doesn't find one).

One simple and fairly performant multiplication algorithm for integers is Russian Peasant Multiplication.
For rationals, you could try a binary quote notation, for which division is easier than usual.

It turns out that I still have some old 68000 assembler code for long multiplication and long division. 68000 code is pretty clean and simple, so should be easy to translate for your chip.
The 68000 had multiply and divide instructions IIRC - I think these were written as a learning exercise.
Decided to just put the code here. Added comments and, in the process, fixed a problem.
;
; Purpose : division of longword by longword to give longword
; : all values signed.
; Requires : d0.L == Value to divide
; : d1.L == Value to divide by
; Changes : d0.L == Remainder
; : d2.L == Result
; : corrupts d1, d3, d4
;
section text
ldiv: move #0,d3 ; Convert d0 -ve to +ve - d3 records original sign
tst.l d0
bpl.s lib5a
neg.l d0
not d3
lib5a: tst.l d1 ; Convert d1 -ve to +ve - d3 records result sign
bpl.s lib5b
neg.l d1
not d3
lib5b: tst.l d1 ; Detect division by zero (not really handled well)
bne.s lib3a
rts
lib3a: moveq.l #0,d2 ; Init working result d2
moveq.l #1,d4 ; Init d4
lib3b: cmp.l d0,d1 ; while d0 < d1 {
bhi.s lib3c
asl.l #1,d1 ; double d1 and d4
asl.l #1,d4
bra.s lib3b ; }
lib3c: asr.l #1,d1 ; halve d1 and d4
asr.l #1,d4
bcs.s lib3d ; stop when d4 reaches zero
cmp.l d0,d1 ; do subtraction if appropriate
bhi.s lib3c
or.l d4,d2 ; update result
sub.l d1,d0
bne.s lib3c
lib3d: ; fix the result and remainder signs
; and.l #$7fffffff,d2 ; don't know why this is here
tst d3
beq.s lib3e
neg.l d2
neg.l d0
lib3e: rts
;
; Purpose : Multiply long by long to give long
; Requires : D0.L == Input 1
; : D1.L == Input 2
; Changes : D2.L == Result
; : D3.L is corrupted
;
lmul: move #0,d3 ; d0 -ve to +ve, original sign in d3
tst.l d0
bpl.s lib4c
neg.l d0
not d3
lib4c: tst.l d1 ; d1 -ve to +ve, result sign in d3
bpl.s lib4d
neg.l d1
not d3
lib4d: moveq.l #0,d2 ; init d2 as working result
lib4a: asr.l #1,d0 ; shift d0 right
bcs.s lib4b ; if a bit fell off, update result
asl.l #1,d1 ; either way, shift left d1
tst.l d0
bne.s lib4a ; if d0 non-zero, continue
tst.l d3 ; basically done - apply sign?
beq.s lib4e ; was broken! now fixed
neg.l d2
lib4e: rts
lib4b: add.l d1,d2 ; main loop body - update result
asl.l #1,d1
bra.s lib4a
By the way - I never did figure out whether it was necessary to convert everything to positive at the start. If you're careful with the shift operations, that may be avoidable overhead.

To multiply, add partial products from the shifted multiplicand to an accumulator iff the corresponding bit in the multiplier is set. Shift multiplicand and multiplier at end of loop, testing multiplier & 1 to see if addition should be done.

The Microchip PICmicro 16Fxxx series chips do not have a multiply or divide instruction.
Perhaps some of the soft multiply and soft divide routines for it can be ported to your MCU.
PIC Microcontroller Basic Math Multiplication Methods
PIC Microcontroller Basic Math Division Methods
Also check out "Newton's method" for division.
I think that method gives the smallest executable size of any division algorithm I've ever seen, although the explanation makes it sound more complicated than it really is.
I hear that some early Cray supercomputers used Newton's method for division.

Related

Floating point numbers and the effect on 8-bit microcontrollers memory

I am currently working on a project that includes bare-metal programming on an stm-8 micro-controller using the SDCC compiler in linux. The memory in the chip is quite low so I'm trying to keep things really lean. I have gotten by with using 8-bit and 16-bit variables and things have gone well. But recently I ran into a problem were I really needed a float variable. So i wrote a function that takes in a 16-bit value converts to a float does the math I need and returns an 8-bit number. This cause my final compiled code on the MCU to go from 1198 Bytes to 3462 Bytes. Now I understand that using floating points is memory intensive and that many functions may need to be called to handle the use of the floating point number but it seems crazy to increase the size of the program by that much. I would like some help understanding why this is and what happened exactly.
Specs: MCU stm8151f2
Compiler: SDCC with --opt_code_size option
int roundNo(uint16_t bit_input)
{
float num = (((float)bit_input) - ADC_MIN)/124.0;
return num < 0 ? num - 0.5 : num + 0.5;
}
To determine why the code is so large on your particular tool chain, you would need to look at the generated assembly code, and see what FP support calls it makes, then look at the map file to determine the size of each of those functions.
As an example on Godbolt for AVR using GCC 5.4.0 with -Os (Godbolt does not support STM8 or SDCC so this is for comparison as a 8-bit architecture) your code generates 6364 bytes compared 4081 bytes for an empty function. So the additional code required for the code body is 2283 bytes. Now accounting for the fact that you are using both a different compiler and architecture, these are not that different from your results. See in the generated code (below) the rcalls to subroutines such as __divsf3 - these are where the bulk of the code will be, and I suspect FP division is by far the larger contributor.
roundNo(unsigned int):
push r12
push r13
push r14
push r15
mov r22,r24
mov r23,r25
ldi r24,0
ldi r25,0
rcall __floatunsisf
ldi r18,0
ldi r19,0
ldi r20,0
ldi r21,lo8(69)
rcall __subsf3
ldi r18,0
ldi r19,0
ldi r20,lo8(-8)
ldi r21,lo8(66)
rcall __divsf3
mov r12,r22
mov r13,r23
mov r14,r24
mov r15,r25
ldi r18,0
ldi r19,0
ldi r20,0
ldi r21,0
rcall __ltsf2
ldi r18,0
ldi r19,0
ldi r20,0
ldi r21,lo8(63)
sbrs r24,7
rjmp .L6
mov r25,r15
mov r24,r14
mov r23,r13
mov r22,r12
rcall __subsf3
rjmp .L7
.L6:
mov r25,r15
mov r24,r14
mov r23,r13
mov r22,r12
rcall __addsf3
.L7:
rcall __fixsfsi
mov r24,r22
mov r25,r23
pop r15
pop r14
pop r13
pop r12
ret
You need to perform the same analysis on the code generated by your tool chain to answer your question. No doubt SDCC is capable of generating an assembly listing and a map file which will allow you to determine exactly what code and FP support is being generated and linked.
Ultimately though your use of FP in this case is entirely unnecessary:
int roundNo(uint16_t bit_input)
{
int s = (bit_input - ADC_MIN) ;
s += s < 0 ? -62 : 62 ;
return s / 124 ;
}
At Godbolt 2283 bytes compared to an empty function. Still somewhat large, but the issue there most likely is that the AVR lacks a DIV instruction so calls __divmodhi4. STM8 has a DIV for 16 bit dividend and 8 bit divisor, so it will likely be significantly smaller (and faster) on your target.
OK, a version of fixed point that actually works:
// Assume a 28.4 format for math. 12.4 can be used, but roundoff may occur.
// Input should be a literal float (Note that the multiply here will be handled by the
// compiler and not generate FP asm code.
#define TO_FIXED(x) (int)((x * 16))
// Takes a fixed and converts to an int - should turn into a right shift 4.
#define TO_INT(x) (int)((x / 16))
typedef int FIXED;
const uint16_t ADC_MIN = 32768;
int roundNo(uint16_t bit_input)
{
FIXED num = (TO_FIXED(bit_input - ADC_MIN)) / 124;
num += num < 0 ? TO_FIXED(-0.5) : TO_FIXED(0.5);
return TO_INT(num);
}
int main()
{
printf("%d", roundNo(0));
return 0;
}
Note we are using some 32-bit values here so it will be bigger than your current values. With care though, it could possibly convert back to a 12.4 (16-bit int) instead if round off and overflow can be managed carefully.
Or go grab a better full feature Fixed Point library from the web :)
(Update) After writing this, I noticed that #Clifford mentioned that your microcontroller supports this DIV instruction natively, in which case doing this is redundant. Anyway, I will leave it as a concept which can be applied in cases where DIV is implemented as an extern call, or for cases where DIV takes too many cycles and the goal is to make the calculation faster.
Anyway, shifting and adding is likely to be faster than division, if you ever need to squeeze some extra cycles. So if you start from the fact that 124 is almost equal to 4096/33 (the error factor is 0.00098, i.e. 0.098%, so less than 1 in 1000), you can implement the division with a single multiplication with 33 and a shift by 12 bits (division by 4096). Furthermore, 33 is 32+1, meaning multiplying by 33 is equal to shifting left by 5 and adding the input again.
Example: you want to divide 5000 by 124, and 5000/124 is approx. 40.323. What we will be doing is:
5,000 << 5 = 160,000
160,000 + 5,000 = 165,000
165,000 >> 12 = 40
Note that this only works for positive numbers. Also note that, if you're really doing lots of multiplications all over the code, then having a single extern mul or div function might result in smaller overall code in the long run, especially if the compiler is not particularly good at optimizing. And if the compiler can just emit a DIV instruction here, then the only thing you can get is a tiny bit of speed improvement, so don't bother with this.
#include <stdint.h>
#define ADC_MIN 2048
uint16_t roundNo(uint16_t bit_input)
{
// input too low, return zero
if (bit_input < ADC_MIN)
return 0;
bit_input -= (ADC_MIN - 62);
uint32_t x = bit_input;
// this gets us x = x * 33
x <<= 5;
x += bit_input;
// this gets us x = x / 4096
x >>= 12;
return (uint16_t)x;
}
GCC AVR with size optimizations produces this, i.e. all calls to extern mul or div functions are gone, but it seems like AVR doesn't support shifting multiple bits in a single instruction (it emits loops which shift 5 times and 12 times respectively). I don't have a clue what your compiler will do.
If you also need to handle the bit_input < ADC_MIN case, I would handle this part separately, i.e.:
#include <stdint.h>
#include <stdbool.h>
#define ADC_MIN 2048
int16_t roundNo(uint16_t bit_input)
{
// if subtraction would result in a negative value,
// handle it properly
bool negative = (bit_input < ADC_MIN);
bit_input = negative ? (ADC_MIN - bit_input) : (bit_input - ADC_MIN);
// we are always positive from this point on
bit_input -= (ADC_MIN - 62);
uint32_t x = bit_input;
x <<= 5;
x += bit_input;
x >>= 12;
return negative ? -(int16_t)x : (int16_t)x;
}

Handling zeroes in _mm256_rsqrt_ps()

Given that _mm256_sqrt_ps() is relatively slow, and that the values I am generating are immediately truncated with _mm256_floor_ps(), looking around it seems that doing:
_mm256_mul_ps(_mm256_rsqrt_ps(eightFloats),
eightFloats);
Is the way to go for that extra bit of performance and avoiding a pipeline stall.
Unfortunately, with zero values, I of course get a crash calculating 1/sqrt(0). What is the best way around this? I have tried this (which works and is faster), but is there a better way, or am I going to run into problems under certain conditions?
_mm256_mul_ps(_mm256_rsqrt_ps(_mm256_max_ps(eightFloats,
_mm256_set1_ps(0.1))),
eightFloats);
My code is for a vertical application, so I can assume that it will be running on a Haswell CPU (i7-4810MQ), so FMA/AVX2 can be used. The original code is approximately:
float vals[MAX];
int sum = 0;
for (int i = 0; i < MAX; i++)
{
int thisSqrt = (int) floor(sqrt(vals[i]));
sum += min(thisSqrt, 0x3F);
}
All the values of vals should be integer values. (Why everything isn't just int is a different question...)
tl;dr: See the end for code that compiles and should work.
To just solve the 0.0 problem, you could also special-case inputs of 0.0 with an FP compare of the source against 0.0. Use the compare result as a mask to zero out any NaNs resulting from 0 * +Infinity in sqrt(x) = x * rsqrt(x)). Clang does this when autovectorizing. (But it uses blendps with the zeroed vector, instead of using the compare mask with andnps directly to zero or preserve elements.)
It would also be possible to use sqrt(x) ~= recip(rsqrt(x)), as suggested by njuffa. rsqrt(0) = +Inf. recip(+Inf) = 0. However, using two approximations would compound the relative error, which is a problem.
The thing you're missing:
Truncating to integer (instead of rounding) requires an accurate sqrt result when the input is a perfect square. If the result for 25*rsqrt(25) is 4.999999 or something (instead of 5.00001), you'll add 4 instead of 5.
Even with a Newton-Raphson iteration, rsqrtps isn't perfectly accurate the way sqrtps is, so it might still give 5.0 - 1ulp. (1ulp = one unit in the last place = lowest bit of the mantissa).
Also:
Newton Raphson formula explained
Newton Raphson SSE implementation performance (latency/throughput). Note that we care more about throughput than latency, since we're using it in a loop that doesn't do much else. sqrt isn't part of the loop-carried dep chain, so different iterations can have their sqrt calcs in flight at once.
It might be possible to kill 2 birds with one stone by adding a small constant before doing the (x+offset)*approx_rsqrt(x+offset) and then truncating to integer. Large enough to overcome the max relative error of 1.5*2-12, but small enough not to bump sqrt_approx(63*63-1+offset) up to 63 (the most sensitive case).
63*1.5*2^(-12) == 0.023071...
approx_sqrt(63*63-1) == 62.99206... +/- 0.023068..
Actually, we're screwed without a Newton iteration even without adding anything. approx_sqrt(63*63-1) could come out above 63.0 all by itself. n=36 is the largest value where the relative error in sqrt(n*n-1) + error is less than sqrt(n*n). GNU Calc:
define f(n) { local x=sqrt(n*n-1); local e=x*1.5*2^(-12); print x; print e, x+e; }
; f(36)
35.98610843089316319413
~0.01317850650545403926 ~35.99928693739861723339
; f(37)
36.9864840178138587015
~0.01354485498699237990 ~37.00002887280085108140
Does your source data have any properties that mean you don't have to worry about it being just below a large perfect square? e.g. is it always perfect squares?
You could check all possible input values, since the important domain is very small (integer FP values from 0..63*63) to see if the error in practice is small enough on Intel Haswell, but that would be a brittle optimization that could make your code break on AMD CPUs, or even on future Intel CPUs. Unfortunately, just coding to the ISA spec's guarantee that the relative error is up to 1.5*2-12 requires more instructions. I don't see any tricks a NR iteration.
If your upper limit was smaller (like 20), you could just do isqrt = static_cast<int> ((x+0.5)*approx_rsqrt(x+0.5)). You'd get 20 for 20*20, but always 19 for 20*20-1.
; define test_approx_sqrt(x, off) { local s=x*x+off; local sq=s/sqrt(s); local sq_1=(s-1)/sqrt(s-1); local e=1.5*2^(-12); print sq, sq_1; print sq*e, sq_1*e; }
; test_approx_sqrt(20, 0.5)
~20.01249609618950056874 ~19.98749609130668473087 # (x+0.5)/sqrt(x+0.5)
~0.00732879495710064718 ~0.00731963968187500662 # relative error
Note that val * (x +/- err) = val*x +/- val*err. IEEE FP mul produces results that are correctly rounded to 0.5ulp, so this should work for FP relative errors.
Anyway, I think you need one Newton-Raphson iteration.
The best bet is to add 0.5 to your input before doing an approx_sqrt using rsqrt. That sidesteps the 0/0 = NaN problem, and pushes the +/- error range all to one side of the whole number cut point (for numbers in the range we care about).
FP min/max instructions have the same performance as FP add, and will be on the critical path either way. Using an add instead of a max also solves the problem of results for perfect squares potentially being a few ulp below the correct result.
Compiler output: a decent starting point
I get pretty good autovectorization results from clang 3.7.1 with sum_int, with -fno-math-errno -funsafe-math-optimizations. -ffinite-math-only is not required (but even with the full -ffast-math, clang avoids sqrt(0) = NaN when using rsqrtps).
sum_fp doesn't auto-vectorize, even with the full -ffast-math.
However clang's version suffers from the same problem as your idea: truncating an inexact result from rsqrt + NR, potentially giving the wrong integer. IDK if this is why gcc doesn't auto-vectorize, because it could have used sqrtps for a big speedup without changing the results. (At least, as long as all the floats are between 0 and INT_MAX2, otherwise converting back to integer will give the "indefinite" result of INT_MIN. (sign bit set, all other bits cleared). This is a case where -ffast-math breaks your program, unless you use -mrecip=none or something.
See the asm output on godbolt from:
// autovectorizes with clang, but has rounding problems.
// Note the use of sqrtf, and that floorf before truncating to int is redundant. (removed because clang doesn't optimize away the roundps)
int sum_int(float vals[]){
int sum = 0;
for (int i = 0; i < MAX; i++) {
int thisSqrt = (int) sqrtf(vals[i]);
sum += std::min(thisSqrt, 0x3F);
}
return sum;
}
To manually vectorize with intrinsics, we can look at the asm output from -fno-unroll-loops (to keep things simple). I was going to include this in the answer, but then realized that it had problems.
putting it together:
I think converting to int inside the loop is better than using floorf and then addps. roundps is a 2-uop instruction (6c latency) on Haswell (1uop in SnB/IvB). Worse, both uops require port1, so they compete with FP add / mul. cvttps2dq is a 1-uop instruction for port1, with 3c latency, and then we can use integer min and add to clamp and accumulate, so port5 gets something to do. Using an integer vector accumulator also means the loop-carried dependency chain is 1 cycle, so we don't need to unroll or use multiple accumulators to keep multiple iterations in flight. Smaller code is always better for the big picture (uop cache, L1 I-cache, branch predictors).
As long as we aren't in danger of overflowing 32bit accumulators, this seems to be the best choice. (Without having benchmarked anything or even tested it).
I'm not using the sqrt(x) ~= approx_recip(approx_sqrt(x)) method, because I don't know how to do a Newton iteration to refine it (probably it would involve a division). And because the compounded error is larger.
Horizontal sum from this answer.
Complete but untested version:
#include <immintrin.h>
#define MAX 4096
// 2*sqrt(x) ~= 2*x*approx_rsqrt(x), with a Newton-Raphson iteration
// dividing by 2 is faster in the integer domain, so we don't do it
__m256 approx_2sqrt_ps256(__m256 x) {
// clang / gcc usually use -3.0 and -0.5. We could do the same by using fnmsub_ps (add 3 = subtract -3), so we can share constants
__m256 three = _mm256_set1_ps(3.0f);
//__m256 half = _mm256_set1_ps(0.5f); // we omit the *0.5 step
__m256 nr = _mm256_rsqrt_ps( x ); // initial approximation for Newton-Raphson
// 1/sqrt(x) ~= nr * (3 - x*nr * nr) * 0.5 = nr*(1.5 - x*0.5*nr*nr)
// sqrt(x) = x/sqrt(x) ~= (x*nr) * (3 - x*nr * nr) * 0.5
// 2*sqrt(x) ~= (x*nr) * (3 - x*nr * nr)
__m256 xnr = _mm256_mul_ps( x, nr );
__m256 three_minus_muls = _mm256_fnmadd_ps( xnr, nr, three ); // -(xnr*nr) + 3
return _mm256_mul_ps( xnr, three_minus_muls );
}
// packed int32_t: correct results for inputs from 0 to well above 63*63
__m256i isqrt256_ps(__m256 x) {
__m256 offset = _mm256_set1_ps(0.5f); // or subtract -0.5, to maybe share constants with compiler-generated Newton iterations.
__m256 xoff = _mm256_add_ps(x, offset); // avoids 0*Inf = NaN, and rounding error before truncation
__m256 approx_2sqrt_xoff = approx_2sqrt_ps256(xoff);
__m256i i2sqrtx = _mm256_cvttps_epi32(approx_2sqrt_xoff);
return _mm256_srli_epi32(i2sqrtx, 1); // divide by 2 with truncation
// alternatively, we could mask the low bit to zero and divide by two outside the loop, but that has no advantage unless port0 turns out to be the bottleneck
}
__m256i isqrt256_ps_simple_exact(__m256 x) {
__m256 sqrt_x = _mm256_sqrt_ps(x);
__m256i isqrtx = _mm256_cvttps_epi32(sqrt_x);
return isqrtx;
}
int hsum_epi32_avx(__m256i x256){
__m128i xhi = _mm256_extracti128_si256(x256, 1);
__m128i xlo = _mm256_castsi256_si128(x256);
__m128i x = _mm_add_epi32(xlo, xhi);
__m128i hl = _mm_shuffle_epi32(x, _MM_SHUFFLE(1, 0, 3, 2));
hl = _mm_add_epi32(hl, x);
x = _mm_shuffle_epi32(hl, _MM_SHUFFLE(2, 3, 0, 1));
hl = _mm_add_epi32(hl, x);
return _mm_cvtsi128_si32(hl);
}
int sum_int_avx(float vals[]){
__m256i sum = _mm256_setzero_si256();
__m256i upperlimit = _mm256_set1_epi32(0x3F);
for (int i = 0; i < MAX; i+=8) {
__m256 v = _mm256_loadu_ps(vals+i);
__m256i visqrt = isqrt256_ps(v);
// assert visqrt == isqrt256_ps_simple_exact(v) or something
visqrt = _mm256_min_epi32(visqrt, upperlimit);
sum = _mm256_add_epi32(sum, visqrt);
}
return hsum_epi32_avx(sum);
}
Compiles on godbolt to nice code, but I haven't tested it. clang makes slightly nicer code that gcc: clang uses broadcast-loads from 4B locations for the set1 constants, instead of repeating them at compile time into 32B constants. gcc also has a bizarre movdqa to copy a register.
Anyway, the whole loop winds up being only 9 vector instructions, compared to 12 for the compiler-generated sum_int version. It probably didn't notice the x*initial_guess(x) common-subexpressions that occur in the Newton-Raphson iteration formula when you're multiplying the result by x, or something like that. It also does an extra mulps instead of a psrld because it does the *0.5 before converting to int. So that's where the extra two mulps instructions come from, and there's the cmpps/blendvps.
sum_int_avx(float*):
vpxor ymm3, ymm3, ymm3
xor eax, eax
vbroadcastss ymm0, dword ptr [rip + .LCPI4_0] ; set1(0.5)
vbroadcastss ymm1, dword ptr [rip + .LCPI4_1] ; set1(3.0)
vpbroadcastd ymm2, dword ptr [rip + .LCPI4_2] ; set1(63)
LBB4_1: ; latencies
vaddps ymm4, ymm0, ymmword ptr [rdi + 4*rax] ; 3c
vrsqrtps ymm5, ymm4 ; 7c
vmulps ymm4, ymm4, ymm5 ; x*nr ; 5c
vfnmadd213ps ymm5, ymm4, ymm1 ; 5c
vmulps ymm4, ymm4, ymm5 ; 5c
vcvttps2dq ymm4, ymm4 ; 3c
vpsrld ymm4, ymm4, 1 ; 1c this would be a mulps (but not on the critical path) if we did this in the FP domain
vpminsd ymm4, ymm4, ymm2 ; 1c
vpaddd ymm3, ymm4, ymm3 ; 1c
; ... (those 9 insns repeated: loop unrolling)
add rax, 16
cmp rax, 4096
jl .LBB4_1
;... horizontal sum
IACA thinks that with no unroll, Haswell can sustain a throughput of one iteration per 4.15 cycles, bottlenecking on ports 0 and 1. So potentially you could shave a cycle by accumulating sqrt(x)*2 (with truncation to even numbers using _mm256_and_si256), and only divide by two outside the loop.
Also according to IACA, the latency of a single iteration is 38 cycles on Haswell. I only get 31c, so probably it's including L1 load-use latency or something. Anyway, this means that to saturate the execution units, operations from 8 iterations have to be in flight at once. That's 8 * ~14 unfused-domain uops = 112 unfused-uops (or less with clang's unroll) that have to be in flight at once. Haswell's scheduler is actually only 60 entries, but the ROB is 192 entries. The early uops from early iterations will already have executed, so they only need to be tracked in the ROB, not also in the scheduler. Many of the slow uops are at the beginning of each iteration, though. Still, there's reason to hope that this will come close-ish to saturating ports 0 and 1. Unless data is hot in L1 cache, cache/memory bandwidth will probably be the bottleneck.
Interleaving operations from multiple dep chains would also be better. When clang unrolls, it puts all 9 instructions for one iteration ahead of all 9 instructions for another iteration. It uses a surprisingly small number of registers, so it would be possible to have instructions for 2 or 4 iterations mixed together. This is the sort of thing compilers are supposed to be good at, but which is cumbersome for humans. :/
It would also be slightly more efficient if the compiler chose a one-register addressing mode, so the load could micro-fuse with the vaddps. gcc does this.

Trick to divide a constant (power of two) by an integer

NOTE This is a theoretical question. I'm happy with the performance of my actual code as it is. I'm just curious about whether there is an alternative.
Is there a trick to do an integer division of a constant value, which is itself an integer power of two, by an integer variable value, without having to use do an actual divide operation?
// The fixed value of the numerator
#define SIGNAL_PULSE_COUNT 0x4000UL
// The division that could use a neat trick.
uint32_t signalToReferenceRatio(uint32_t referenceCount)
{
// Promote the numerator to a 64 bit value, shift it left by 32 so
// the result has an adequate number of bits of precision, and divide
// by the numerator.
return (uint32_t)((((uint64_t)SIGNAL_PULSE_COUNT) << 32) / referenceCount);
}
I've found several (lots) of references for tricks to do division by a constant, both integer and floating point. For example, the question What's the fastest way to divide an integer by 3? has a number of good answers including references to other academic and community materials.
Given that the numerator is constant, and it's an integer power of two, is there a neat trick that could be used in place of doing an actual 64 bit division; some kind of bit-wise operation (shifts, AND, XOR, that kind of stuff) or similar?
I don't want any loss of precision (beyond a possible half bit due to integer rounding) greater than that of doing the actual division, as the precision of the instrument relies on the precision of this measurement.
"Let the compiler decide" is not an answer, because I want to know if there is a trick.
Extra, Contextual Information
I'm developing a driver on a 16 bit data, 24 bit instruction word micro-controller. The driver does some magic with the peripheral modules to obtain a pulse count of a reference frequency for a fixed number of pulses of a signal frequency. The required result is a ratio of the signal pulses to the reference pulse, expressed as an unsigned 32 bit value. The arithmetic for the function is defined by the manufacturer of the device for which I'm developing the driver, and the result is processed further to obtain a floating point real-world value, but that's outside the scope of this question.
The micro-controller I'm using has a Digital Signal Processor that has a number of division operations that I could use, and I'm not afraid to do so if necessary. There would be some minor challenges to overcome with this approach, beyond the putting together the assembly instructions to make it work, such as the DSP being used to do a PID function in a BLDC driver ISR, but nothing I can't manage.
You cannot use clever mathematical tricks to not do a division, but you can of course still use programming tricks if you know the range of your reference count:
Nothing beats a pre-computed lookup table in terms of speed.
There are fast approximate square root algorithms (probably already in your DSP), and you can improve the approximation by one or two Newton-Raphson iterations. If doing the computation with floating-point numbers is accurate enough for you, you can probably beat a 64bit integer division in terms of speed (but not in clarity of code).
You mentioned that the result will be converted to floating-point later, it might be beneficial to not compute the integer division at all, but use your floating point hardware.
I worked out a Matlab version, using fixed point arithmetic.
This method assumes that a integer version of log2(x) can be calculated efficiently, which is true for dsPIC30/33F and TI C6000 that have instruction to detect the most significant 1 of an integer.
For this reason, this code has strong ISA depency and can not be written in portable/standard C and can be improved using instructions like multiply-and-add, multiply-and-shift, so I won't try translating it to C.
nrdiv.m
function [ y ] = nrdiv( q, x, lut)
% assume q>31, lut = 2^31/[1,1,2,...255]
p2 = ceil(log2(x)); % available in TI C6000, instruction LMBD
% available in Microchip dsPIC30F/33F, instruction FF1L
if p2<8
pre_shift=0;
else
pre_shift=p2-8;
end % shr = (p2-8)>0?(p2-8):0;
xn = shr(x, pre_shift); % xn = x>>pre_shift;
y = shr(lut(xn), pre_shift); % y = lut[xn]>pre_shift;
y = shr(y * (2^32 - y*x), 30); % basic iteration
% step up from q31 to q32
y = shr(y * (2^33 - y*x), (64-q)); % step up from q32 to desired q
if q>39
y = shr(y * (2^(1+q) - y*x), (q)); % when q>40, additional
% iteration is required,
end % no step up is performed
end
function y = shr(x, r)
y=floor(x./2^r); % simulate operator >>
end
test.m
test_number = (2^22-12345);
test_q = 48;
lut_q31 = round(2^31 ./ [1,[1:1:255]]);
display(sprintf('tested 2^%d/%d, diff=%f\n',test_q, test_number,...
nrdiv( 39, (2^22-5), lut_q31) - 2^39/(2^22-5)));
sample output
tested 2^48/4181959, diff=-0.156250
reference:
Newton–Raphson division
A little late but here is my solution.
First some assumptions:
Problem:
X=N/D where N is a constant ans a power of 2.
All 32 bit unsigned integers.
X is unknown but we have a good estimate
(previous but no longer accurate solution).
An exact solution is not required.
Note: due to integer truncation this is not an accurate algorithm!
An iterative solution is okay (improves with each loop).
Division is much more expensive than multiplication:
For 32bit unsigned integer for Arduino UNO:
'+/-' ~0.75us
'*' ~3.5us
'/' ~36us 4 We seek to replace the Basically lets start with Newton's method:
Xnew=Xold-f(x)/(f`(x)
where f(x)=0 for the solution we seek.
Solving this I get:
Xnew=XNew*(C-X*D)/N
where C=2*N
First trick:
Now that the Numerator (constant) is now a Divisor (constant) then one solution here (which does not require the N to be a power of 2) is:
Xnew=XNew*(C-X*D)*A>>M
where C=2*N, A and M are constants (look for dividing by a constant tricks).
or (staying with Newtons method):
Xnew=XNew*(C-X*D)>>M
where C=2>>M where M is the power.
So I have 2 '*' (7.0us), a '-' (0.75us) and a '>>' (0.75us?) or 8.5us total (rather than 36us), excluding other overheads.
Limitations:
As the data type is 32 bit unsigned, 'M' should not exceed 15 else there will be problems with overflow (you can probably get around this using a 64bit intermediate data type).
N>D (else the algorithm blows up! at least with unsigned integer)
Obviously the algorithm will work with signed and float data types)
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
int main(void)
{
unsigned long c,d,m,x;
// x=n/d where n=1<<m
m=15;
c=2<<m;
d=10;
x=10;
while (true)
{
x=x*(c-d*x)>>m;
printf("%ld",x);
getchar();
}
return(0);
}
Having tried many alternatives, I ended up doing normal binary long division in assembly language. However, the routine does use a few optimisations that bring the execution time down to an acceptable level.
/*
* Converts the reference frequency count for a specific signal frequency
* to a ratio.
* Xs = Ns * 2^32 / Nr
* Where:
* 2^32 is a constant scaling so that the maximum accuracy can be achieved.
* Ns is the number of signal counts (fixed at 0x4000 by hardware).
* Nr is the number of reference counts, passed in W1:W0.
* #param W1:W0 The number of reference frequency pulses.
* #return W1:W0 The scaled ratio.
*/
.align 2
.global _signalToReferenceRatio
.type _signalToReferenceRatio, #function
; This is the position of the most significant bit of the fixed Ns (0x4000).
.equ LOG2_DIVIDEND, 14
.equ DIVISOR_LIMIT, LOG2_DIVIDEND+1
.equ WORD_SIZE, 16
_signalToReferenceRatio:
; Create a dividend, MSB-aligned with the divisor, in W2:W3 and place the
; number of iterations required for the MSW in [W14] and the LSW in [W14+2].
LNK #4
MUL.UU W2, #0, W2
FF1L W1, W4
; If MSW is zero the argument is out of range.
BRA C, .returnZero
SUBR W4, #WORD_SIZE, W4
; Find the number of quotient MSW loops.
; This is effectively 1 + log2(dividend) - log2(divisor).
SUBR W4, #DIVISOR_LIMIT, [W14]
BRA NC, .returnZero
; Since the SUBR above is always non-negative and the C flag set, use this
; to set bit W3<W5> and the dividend in W2:W3 = 2^(16+W5) = 2^log2(divisor).
BSW.C W3, W4
; Use 16 quotient LSW loops.
MOV #WORD_SIZE, W4
MOV W4, [W14+2]
; Set up W4:W5 to hold the divisor and W0:W1 to hold the result.
MOV.D W0, W4
MUL.UU W0, #0, W0
.checkLoopCount:
; While the bit count is non-negative ...
DEC [W14], [W14]
BRA NC, .nextWord
.alignQuotient:
; Shift the current quotient word up by one bit.
SL W0, W0
; Subtract divisor from the current dividend part.
SUB W2, W4, W6
SUBB W3, W5, W7
; Check if the dividend part was less than the divisor.
BRA NC, .didNotDivide
; It did divide, so set the LSB of the quotient.
BSET W0, #0
; Shift the remainder up by one bit, with the next zero in the LSB.
SL W7, W3
BTSC W6, #15
BSET W3, #0
SL W6, W2
BRA .checkLoopCount
.didNotDivide:
; Shift the next (zero) bit of the dividend into the LSB of the remainder.
SL W3, W3
BTSC W2, #15
BSET W3, #0
SL W2, W2
BRA .checkLoopCount
.nextWord:
; Test if there are any LSW bits left to calculate.
MOV [++W14], W6
SUB W6, #WORD_SIZE, [W14--]
BRA NC, .returnQ
; Decrement the remaining bit counter before writing it back.
DEC W6, [W14]
; Move the working part of the quotient up into the MSW of the result.
MOV W0, W1
BRA .alignQuotient
.returnQ:
; Return the quotient in W0:W1.
ULNK
RETURN
.returnZero:
MUL.UU W0, #0, W0
ULNK
RETURN
.size _signalToReferenceRatio, .-_signalToReferenceRatio

Addition in neon register

suppose I have a 64 bit d register in neon. lets say it stores the value ABCDEFGH.
Now I want o add A&E, B&F, C&G, D&H and so on.. Is here any intrinsic by which it is possible to so such an operation
I looked at the documentation but didn't find something suitable.
If you want the addition to be carried out in 16 bits, i.e. produce an uint16x4 result, you can use vmovl to promote the input vector from uint8x8 to uint8x16, then use vadd to add the lower and higher halves. Expressed in NEON intrinsics, this is achieved by
const int16x8_t t = vmovl_u8(input);
const int16x4_t r = vadd_u16(vget_low(t), vget_high(t))
This should compile to the following assembly (d0 is the 64-bit input register, d1 is the 64-bit output register). Note the vget_low and vget_high don't produce any instructions - these intrinsics are implemented by suitable register allocation, by exploiting that Q registers are just a convenient way to name two consecutive D register. Q{n} refers to the pair (D{2n}, D{2n+1}).
VMOVL.U8 q1, d0
VADD.I16 d1, d2
If you want the operation to be carried out in 8 bits, and saturate in case of an overflow, do
const int8x8_t t = vreinterpret_u8_u64(vshr_n_u64(vreinterpret_u64_u8(input), 32));
const int8x8_t r = vqadd_u8(input, t);
This compiles to (d0 is the input again, output in d1)
VSHR.U64 d1, d0, #32
VQADD.I8 d1, d0
By replacing VQADD with just VADD, the results will wrap-around on overflow instead of being saturated to 0xff.

How do I reorder vector data using ARM Neon intrinsics?

This is specifically related to ARM Neon SIMD coding. I am using ARM Neon instrinsics for certain module in a video decoder. I have a vectorized data as follows:
There are four 32 bit elements in a Neon register - say, Q0 - which is of size 128 bit.
3B 3A 1B 1A
There are another four, 32 bit elements in other Neon register say Q1 which is of size 128 bit.
3D 3C 1D 1C
I want the final data to be in order as shown below:
1D 1C 1B 1A
3D 3C 3B 3A
What Neon instrinsics can achieve the desired data order?
how about something like this:
int32x4_t q0, q1;
/* split into 64 bit vectors */
int32x2_t q0_hi = vget_high_s32 (q0);
int32x2_t q1_hi = vget_high_s32 (q1);
int32x2_t q0_lo = vget_low_s32 (q0);
int32x2_t q1_lo = vget_low_s32 (q1);
/* recombine into 128 bit vectors */
q0 = vcombine_s32 (q0_lo, q1_lo);
q1 = vcombine_s32 (q0_hi, q1_hi);
In theory this should compile to just two move instructions because the vget_high and vget_low just reinterpret the 128 bit Q registers as two 64 bit D registers. vcombine otoh just compiles to one or two moves (depends on register allocation).
Oh - and the order of the integers in the output could be exactly the wrong way around. If so just swap the arguments to vcombine_s32.
Remember each q register is made up of two d registers, for instance the low part of q0 is d0 and the high part d1. So in fact, this operation is just swapping d0 and d3 (or d1 and d2, it is not entirely clear from your data presentation). There is even a swap instruction to do it in one instruction!
Disclaimer: I don't know Neon intrinsics (I directly code in assembly), though I'd be surprised if this couldn't be done using intrinsics.
It looks like you should be able to use the VTRN instruction (e.g. vtrnq_u32) for this.
Pierre is right.
vswp d0, d3
that will do.
#Pierre :
I read the post about NEON on your blog several months ago. I was pleasantly surprised that there was someone like me - writing hand optimized assembly codes, both ARM and NEON.
Nice to see you.

Resources