In one of our last assignments in Computer Science this term we have to apply strength reduction on some code fragments. Most of them were just straight forward, especially with looking into compiler output. But one of them I wont be able to solve, even with the help of the compiler.
Our profs gave us the following hint:
Hint: Inquire how IEEE 754 single-precision floating-point numbers are
represented in memory.
Here is the code snippet: (a is of type double*)
for (int i = 0; i < N; ++i) {
a[i] += i / 5.3;
}
At first I tried to look into the compiler output for this snipped on godbolt. I tried to compile it without any optimization: (note: I copied only the relevant part in the for loop)
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*8]
mov rax, QWORD PTR [rbp-16]
add rax, rdx
movsd xmm1, QWORD PTR [rax]
cvtsi2sd xmm0, DWORD PTR [rbp-4] //division relevant
movsd xmm2, QWORD PTR .LC0[rip] //division relevant
divsd xmm0, xmm2 //division relevant
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*8]
mov rax, QWORD PTR [rbp-16]
add rax, rdx
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
and with -O3:
.L2:
pshufd xmm0, xmm2, 238 //division relevant
cvtdq2pd xmm1, xmm2 //division relevant
movupd xmm6, XMMWORD PTR [rax]
add rax, 32
cvtdq2pd xmm0, xmm0 //division relevant
divpd xmm1, xmm3 //division relevant
movupd xmm5, XMMWORD PTR [rax-16]
paddd xmm2, xmm4
divpd xmm0, xmm3 //division relevant
addpd xmm1, xmm6
movups XMMWORD PTR [rax-32], xmm1
addpd xmm0, xmm5
movups XMMWORD PTR [rax-16], xmm0
cmp rax, rbp
jne .L2
I commented the division part of the assembly code. But this output does not help me understanding how to apply strength reduction on the snippet. (Maybe there are too many optimizations going on to fully understand the output)
Second, I tried to understand the bit representation of the float part 5.3.
Which is:
0 |10000001|01010011001100110011010
Sign|Exponent|Mantissa
But this does not help me either.
If we adopt Wikipedia's definition that
strength reduction is a compiler optimization where expensive operations are replaced with equivalent but less expensive operations
then we can apply strength reduction here by converting the expensive floating-point division into a floating-point multiply plus two floating-point multiply-adds (FMAs). Assuming that double is mapped to IEEE-754 binary64, the default rounding mode for floating-point computation is round-to-nearest-or-even, and that int is a 32-bit type, we can prove the transformation correct by simple exhaustive test:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <math.h>
int main (void)
{
const double rcp_5p3 = 1.0 / 5.3; // 0x1.826a439f656f2p-3
int i = INT_MAX;
do {
double ref = i / 5.3;
double res = fma (fma (-5.3, i * rcp_5p3, i), rcp_5p3, i * rcp_5p3);
if (res != ref) {
printf ("error: i=%2d res=%23.13a ref=%23.13a\n", i, res, ref);
return EXIT_FAILURE;
}
i--;
} while (i >= 0);
return EXIT_SUCCESS;
}
Most modern instances of common processors architectures like x86-64 and ARM64 have hardware support for FMA, such that fma() can be mapped directly to the appropriate hardware instruction. This should be confirmed by looking at the disassembly of the binary generated. Where hardware support for FMA is lacking the transformation obviously should not be applied, as software implementations of fma() are slow and sometimes functionally incorrect.
The basic idea here is that mathematically, division is equivalent to multiplication with the reciprocal. However, that is not necessarily true for finite-precision floating-point arithmetic. The code above tries to improve the likelihood of bit-accurate computation by determining the error in the naive approach with the help of FMA and applying a correction where necessary. For background including literature references see this earlier question.
To the best of my knowledge, there is not yet a general mathematically proven algorithm to determine for which divisors paired with which dividends the above transformation is safe (that is, delivers bit-accurate results), which is why an exhaustive test is strictly necessary to show that the transformation is valid.
In comments, Pascal Cuoq points out that there is an alternative algorithm to potentially strength-reduce floating-point division with a compile-time constant divisor, by precomputing the reciprocal of the divisor to more than native precision and specifically as a double-double. For background see N. Brisebarre and J.-M. Muller, "Correctly rounded multiplication by arbirary precision constant", IEEE Transactions on Computers, 57(2): 162-174, February 2008, which also provides guidance how to determine whether that transformation is safe for any particular constant. Since the present case is simple, I again used exhaustive test to show it is safe. Where applicable, this will reduce the division down to one FMA plus a multiply:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <mathimf.h>
int main (void)
{
const double rcp_5p3_hi = 1.8867924528301888e-1; // 0x1.826a439f656f2p-3
const double rcp_5p3_lo = -7.2921377017921457e-18;// -0x1.0d084b1883f6e0p-57
int i = INT_MAX;
do {
double ref = i / 5.3;
double res = fma (i, rcp_5p3_hi, i * rcp_5p3_lo);
if (res != ref) {
printf ("i=%2d res=%23.13a ref=%23.13a\n", i, res, ref);
return EXIT_FAILURE;
}
i--;
} while (i >= 0);
return EXIT_SUCCESS;
}
To cover another aspect: since all values of type int are exactly representable as double (but not as float), it is possible to get rid of int-to-double conversion that happens in the loop when evaluating i / 5.3 by introducing a floating-point variable that counts from 0.0 to N:
double fp_i = 0;
for (int i = 0; i < N; fp_i += 1, i++)
a[i] += fp_i / 5.3;
However, this kills autovectorization, and introduces a chain of dependent floating-point additions. Floating point addition is typically 3 or 4 cycles, so the last iteration will retire after at least (N-1)*3 cycles, even if the CPU could dispatch the instructions in the loop faster. Thankfully, floating-point division is not fully pipelined, and the rate at which an x86 CPU can dispatch floating-point division roughly matches or exceeds latency of the addition instruction.
This leaves the problem of killed vectorization. It's possible to bring it back by manually unrolling the loop and introducing two independent chains, but with AVX you'd need four chains for full vectorization:
double fp_i0 = 0, fp_i1 = 1;
int i = 0;
for (; i+1 < N; fp_i0 += 2, fp_i1 += 2, i += 2) {
double t0 = a[i], t1 = a[i+1];
a[i] = t0 + fp_i0 / 5.3;
a[i+1] = t1 + fp_i1 / 5.3;
}
if (i < N)
a[i] += i / 5.3;
CAVEAT: After a few days I realized that this answer is incorrect in that it ignores the consequence of underflow (to subnormal or to zero) in the computation of o / 5.3. In this case, multiplying the result by a power of two is “exact” but does not produce the result that dividing the larger integer by 5.3 would have.
i / 5.3 only needs to be computed for odd values of i.
For even values of i, you can simply multiply by 2.0 the value of (i/2)/5.3, which was already computed earlier in the loop.
The remaining difficulty is to reorder the iterations in a way such that each index between 0 and N-1 is handled exactly once and the program does not need to record an arbitrary number of division results.
One way to achieve this is to iterate on all odd numbers o less than N, and after computing o / 5.3 in order to handle index o, to also handle all indexes of the form o * 2**p.
if (N > 0) {
a[0] += 0.0; // this is needed for strict IEEE 754 compliance lol
for (int o = 1; o < N; o += 2) {
double d = o / 5.3;
int i = o;
do {
a[i] += d;
i += i;
d += d;
} while (i < N);
}
}
Note: this does not use the provided hint “Inquire how IEEE 754 single-precision floating-point numbers are represented in memory”. I think I know pretty well how single-precision floating-point numbers are represented in memory, but I do not see how that is relevant, especially since there are no single-precision values or computations in the code to optimize. I think there is a mistake in the way the problem is expressed, but still the above is technically a partial answer to the question as phrased.
I also ignored overflow problems for values of N that come close to INT_MAX in the code snippet above, since the code is already complicated enough.
As an additional note, the above transformation only replaces one division out of two. It does this by making the code unvectorizable (and also less cache-friendly). In your question, gcc -O3 has already shown that automatic vectorization could be applied to the starting point that your professor suggested, and that is likely to be more beneficial than suppressing half the divisions can be. The only good thing about the transformation in this answer is that it is a sort of strength reduction, which your professor requested.
I would expect SSE to be faster than not using SSE. Do I need to add some additional compiler flags? Could it be that I am not seeing a speedup because this is integer code and not floating point?
invocation/output
$ make sum2
clang -O3 -msse -msse2 -msse3 -msse4.1 sum2.c ; ./a.out 123
n: 123
SSE Time taken: 0 seconds 124 milliseconds
vector+vector:begin int: 1 5 127 0
vector+vector:end int: 0 64 66 68
NOSSE Time taken: 0 seconds 115 milliseconds
vector+vector:begin int: 1 5 127 0
vector+vector:end int: 0 64 66 68
compiler
$ clang --version
Apple LLVM version 9.0.0 (clang-900.0.37)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
sum2.c
#include <stdlib.h>
#include <stdio.h>
#include <x86intrin.h>
#include <time.h>
#ifndef __cplusplus
#include <stdalign.h> // C11 defines _Alignas(). This header defines alignas()
#endif
#define CYCLE_COUNT 10000
// add vector and return resulting value on stack
__attribute__((noinline)) __m128i add_iv(__m128i *a, __m128i *b) {
return _mm_add_epi32(*a,*b);
}
// add int vectors via sse
__attribute__((noinline)) void add_iv_sse(__m128i *a, __m128i *b, __m128i *out, int N) {
for(int i=0; i<N/sizeof(int); i++) {
//out[i]= _mm_add_epi32(a[i], b[i]); // this also works
_mm_storeu_si128(&out[i], _mm_add_epi32(a[i], b[i]));
}
}
// add int vectors without sse
__attribute__((noinline)) void add_iv_nosse(int *a, int *b, int *out, int N) {
for(int i=0; i<N; i++) {
out[i] = a[i] + b[i];
}
}
__attribute__((noinline)) void p128_as_int(__m128i in) {
alignas(16) uint32_t v[4];
_mm_store_si128((__m128i*)v, in);
printf("int: %i %i %i %i\n", v[0], v[1], v[2], v[3]);
}
// print first 4 and last 4 elements of int array
__attribute__((noinline)) void debug_print(int *h) {
printf("vector+vector:begin ");
p128_as_int(* (__m128i*) &h[0] );
printf("vector+vector:end ");
p128_as_int(* (__m128i*) &h[32764] );
}
int main(int argc, char *argv[]) {
int n = atoi (argv[1]);
printf("n: %d\n", n);
// sum: vector + vector, of equal length
int f[32768] __attribute__((aligned(16))) = {0,2,4};
int g[32768] __attribute__((aligned(16))) = {1,3,n};
int h[32768] __attribute__((aligned(16)));
f[32765] = 33; f[32766] = 34; f[32767] = 35;
g[32765] = 31; g[32766] = 32; g[32767] = 33;
// https://stackoverflow.com/questions/459691/best-timing-method-in-c
clock_t start = clock();
for(int i=0; i<CYCLE_COUNT; ++i) {
add_iv_sse((__m128i*)f, (__m128i*)g, (__m128i*)h, 32768);
}
int msec = (clock()-start) * 1000 / CLOCKS_PER_SEC;
printf(" SSE Time taken: %d seconds %d milliseconds\n", msec/1000, msec%1000);
debug_print(h);
// process intense function again
start = clock();
for(int i=0; i<CYCLE_COUNT; ++i) {
add_iv_nosse(f, g, h, 32768);
}
msec = (clock()-start) * 1000 / CLOCKS_PER_SEC;
printf("NOSSE Time taken: %d seconds %d milliseconds\n", msec/1000, msec%1000);
debug_print(h);
return EXIT_SUCCESS;
}
Look at the asm: clang -O2 or -O3 probably auto-vectorizes add_iv_nosse (with a check for overlap, since you didn't use int * restrict a and so on).
Use -fno-tree-vectorize to disable auto vectorization, without stopping you from using intrinsics. I'd recommend clang -march=native -mno-avx -O3 -fno-tree-vectorize to test what I think you want to test, scalar integer vs. legacy-SSE paddd. (It works in gcc and clang. In clang, AFAIK it's a synonym for the clang-specific -fno-vectorize.)
BTW, timing both in the same executable hurts the first one, because the CPU doesn't ramp to full turbo right away. You're probably into the timed section of the code before your CPU hits full speed. (So run this a couple times back-to-back, with for i in {1..10}; do time ./a.out; done.
On Linux I'd use perf stat -r5 ./a.out to run it 5 times with performance counters (and I'd split it up so one run tested one or the other, so I could look at perf counters for the whole run.)
Code review:
You forgot stdint.h for uint32_t. I had to add that to get it to compile on Godbolt to see the asm. (Assuming clang-5.0 is something like the Apple clang version you're using. IDK if Apple's clang implies a default -mtune= option, but that would make sense because it's only targeting Mac. Also a baseline SSSE3 would make sense for 64-bit on x86-64 OS X.)
You don't need noinline on debug_print. Also, I'd recommend a different name for CYCLE_COUNT. Cycles in this context makes me think of clock cycles, so call it REP_COUNT or REPEATS or whatever.
Putting your arrays on the stack in main is probably fine. You do initialize both input arrays (to mostly zero, but add performance isn't data-dependent).
This is good, because leaving them uninitialized might mean that multiple 4k pages of each array was copy-on-write mapped to the same physical zero page, so you'd get more than the expected number of L1D cache hits.
The SSE2 loop should bottleneck on L2 / L3 cache bandwidth, since the working set it 4 * 32kiB * 3 = 384 kiB, so it's about 1.5x the 256kiB L2 cache in Intel CPUs.
clang might unroll it's auto-vectorized loop more than it does your manual intrinsics loop. That might explain better performance, since only 16B vectors (not 32B AVX2) might not saturate cache bandwidth if you're not getting 2 loads + 1 store per clock.
Update: actually the loop overhead is pretty extreme, with 3 pointer increments + a loop counter, and only unrolling by 2 to amortize that.
The auto-vectorized loop:
.LBB2_12: # =>This Inner Loop Header: Depth=1
movdqu xmm0, xmmword ptr [r9 - 16]
movdqu xmm1, xmmword ptr [r9] # hoisted load for 2nd unrolled iter
movdqu xmm2, xmmword ptr [r10 - 16]
paddd xmm2, xmm0
movdqu xmm0, xmmword ptr [r10]
paddd xmm0, xmm1
movdqu xmmword ptr [r11 - 16], xmm2
movdqu xmmword ptr [r11], xmm0
add r9, 32
add r10, 32
add r11, 32
add rbx, -8 # add / jne macro-fused on SnB-family CPUs
jne .LBB2_12
So it's 12 fused-domain uops, and can run at best 2 vectors per 3 clocks, bottlenecked on the front-end issue bandwidth of 4 uops per clock.
It's not using aligned loads because the compiler doesn't have that info without inlining into main where the alignment is known, and you didn't guarantee alignment with p = __builtin_assume_aligned(p, 16) or anything in the stand-alone function. Aligned loads (or AVX) would let paddd use a memory operand instead of a separate movdqu load.
The manually-vectorized loop uses aligned loads to save front-end uops, but has more loop overhead from the loop counter.
.LBB1_7: # =>This Inner Loop Header: Depth=1
movdqa xmm0, xmmword ptr [rcx - 16]
paddd xmm0, xmmword ptr [rax - 16]
movdqu xmmword ptr [r11 - 16], xmm0
movdqa xmm0, xmmword ptr [rcx]
paddd xmm0, xmmword ptr [rax]
movdqu xmmword ptr [r11], xmm0
add r10, 2 # separate loop counter
add r11, 32 # 3 pointer incrmeents
add rax, 32
add rcx, 32
cmp r9, r10 # compare the loop counter
jne .LBB1_7
So it's 11 fused-domain uops. It should be running faster than the auto-vectorized loop. Your timing method probably caused the problem.
(Unless mixing loads and stores is actually making it less optimal. The auto-vectorized loop did 4 loads and then 2 stores. Actually that might explain it. Your arrays are a multiple of 4kiB, and might all have the same relative alignment. So you might be getting 4k aliasing here, which means the CPU isn't sure that a store doesn't overlap a load. I think there's a performance counter you can check for that.)
See also Agner Fog's microarch guide (and instruction tables + optimization guide, and other links in the x86 tag wiki, especially Intel's optimization guide.
There's also some good SSE/SIMD beginner stuff in the sse tag wiki.
I have implemented a program using SSE2 to compare the vpsadbw instruction and psadbw of AVX2 and SSE2 respectively. The following code is the SSE2 program:
#define MAX1 4096
#define MAX2 MAX1
#define MAX3 MAX1
#define NUM_LOOP 1000000000
double pTime = 0, mTime = 5;
//global data for sequentila matrix operations
unsigned char a_char[MAX1][MAX2] __attribute__(( aligned(16)));
unsigned char b_char[MAX2][MAX3] __attribute__(( aligned(16)));
unsigned char c_char[MAX1][MAX3] __attribute__(( aligned(16)));
unsigned short int temp[8];
int main()
{
int i, j, w=0, sad=0;
struct timespec tStart, tEnd;
double tTotal , tBest=10000;
__m128i vec1, vec2, vecT, sad_total;
sad_total= _mm_setzero_si128();
do{
clock_gettime(CLOCK_MONOTONIC,&tStart);
for(i=0; i<MAX1; i++){
for(j=0; j<MAX2; j+=16){
vec1 = _mm_load_si128((__m128i *)&a_char[i][j]);
vec2 = _mm_load_si128((__m128i *)&b_char[i][j]);
vecT = _mm_sad_epu8( vec1 , vec2);
sad_total = _mm_add_epi64(vecT, sad_total);
}
}
_mm_store_si128((__m128i *)&temp[0], sad_total);
sad=temp[0]+temp[2]+temp[4]+temp[6];
clock_gettime(CLOCK_MONOTONIC,&tEnd);
tTotal = (tEnd.tv_sec - tStart.tv_sec);
tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;
if(tTotal<tBest)
tBest=tTotal;
pTime += tTotal;
} while(w++ < NUM_LOOP && pTime < mTime);
printf(" The best time: %lf sec in %d repetition for %dX result is %d matrix\n",tBest,w, MAX1, sad);
return 0;
}
I use gcc, skylake, Linux mint
When I generate the assembly code the inner loop contain some unwanted move operation as follows for SSE2:
.L26:
vmovdqa xmm1, XMMWORD PTR a_char[rcx+rax]
vpsadbw xmm1, xmm1, XMMWORD PTR b_char[rcx+rax]
add rax, 16
vpaddq xmm3, xmm1, XMMWORD PTR [rsp]
cmp rax, 4096
vmovaps XMMWORD PTR [rsp], xmm3
jne .L26
Since AVX2 generates this assembly code:
.L26:
vmovdqa ymm1, YMMWORD PTR a_char[rcx+rax]
vpsadbw ymm1, ymm1, YMMWORD PTR b_char[rcx+rax]
add rax, 32
vpaddq ymm2, ymm2, ymm1
cmp rax, 4096
jne .L26
I don't know the reason of those 2 move instruction which violated the performance significantly.
The reason is this:
_mm_store_si128((__m128i *)&temp[0], sad_total);
Clang doesn't mind and makes nice code regardless, but GCC didn't like it (failed heuristics perhaps?)
With that replaced to something that doesn't trigger the "this should be on the stack all the time"-heuristic, GCC makes nicer code, for example: (not tested)
__m128i sad_total = _mm_setzero_si128();
for(i = 0; i < MAX1; i++) {
for(j = 0; j < MAX2; j += 16) {
__m128i vec1 = _mm_load_si128((__m128i *)&a_char[i][j]);
__m128i vec2 = _mm_load_si128((__m128i *)&b_char[i][j]);
__m128i vecT = _mm_sad_epu8( vec1 , vec2);
sad_total = _mm_add_epi64(sad_total, vecT);
}
}
__m128i hsum = _mm_add_epi64(sad_total, _mm_bsrli_si128(sad_total, 8));
sad = _mm_cvtsi128_si32(hsum);
The inner loop now looks like
.L2:
vmovdqa xmm1, XMMWORD PTR a_char[rdx+rax]
vpsadbw xmm1, xmm1, XMMWORD PTR b_char[rdx+rax]
add rax, 16
vpaddq xmm2, xmm1, xmm2
cmp rax, 4096
jne .L2
You're directly bypassing the compiler and telling it to use movdqa via _mm_load_si128. It's doing exactly what you're telling it to do. What is the problem here? I also noticed that you're aligning along 16byte boundary, feel free to correct me if I'm wrong (I'm not sure of how attribute is implemented on your compiler) but you may get padding as result so that each element will be aligned on a 16byte boundary; if so this will affect the impact of your unrolling. If not then feel free to correct me.
int eq3(int a, int b, int c, int d, int e, int f){
return a == d || a == e || a == f
|| b == d || b == e || b == f
|| c == d || c == e || c == f;
}
This function receives 6 ints and returns true if any of the 3 first ints is equal to any of the 3 last ints. Is there any bitwise-hack similar way to make it faster?
Assuming you're expecting a high rate of false results you could make a quick "pre-check" to quickly isolate such cases:
If a bit in a is set that isn't set in any of d, e and f then a cannot be equal to any of these.
Thus something like
int pre_eq3(int a, int b, int c, int d, int e, int f){
int const mask = ~(d | e | f);
if ((a & mask) && (b & mask) && (c & mask)) {
return false;
}
return eq3(a, b, c, d, e, f);
}
could speed it up (8 operations instead of 9 17, but much more costly if the result will actually be true). If mask == 0 then of course this won't help.
This can be further improved if with high probability a & b & c has some bits set:
int pre_eq3(int a, int b, int c, int d, int e, int f){
int const mask = ~(d | e | f);
if ((a & b & c) & mask) {
return false;
}
if ((a & mask) && (b & mask) && (c & mask)) {
return false;
}
return eq3(a, b, c, d, e, f);
}
Now if all of a, b and c have bits set where none of d, e and c have any bits set we're out pretty fast.
Expanding on dawg's SSE comparison method, you can combine the results of the comparisons using a vector OR, and move a mask of the compare results back to an integer to test for 0 / non-zero.
Also, you can get data into vectors more efficiently (although it's still pretty clunky to get many separate integers into vectors when they're live in registers to start with, rather than sitting in memory).
You should avoid store-forwarding stalls that result from doing three small stores and one big load.
///// UNTESTED ////////
#include <immintrin.h>
int eq3(int a, int b, int c, int d, int e, int f){
// Use _mm_set to let the compiler worry about getting integers into vectors
// Use -mtune=intel or gcc will make bad code, though :(
__m128i abcc = _mm_set_epi32(0,c,b,a); // args go from high to low position in the vector
// masking off the high bits of the result-mask to avoid false positives
// is cheaper than repeating c (to do the same compare twice)
__m128i dddd = _mm_set1_epi32(d);
__m128i eeee = _mm_set1_epi32(e);
dddd = _mm_cmpeq_epi32(dddd, abcc);
eeee = _mm_cmpeq_epi32(eeee, abcc); // per element: 0(unequal) or -1(equal)
__m128i combined = _mm_or_si128(dddd, eeee);
__m128i ffff = _mm_set1_epi32(f);
ffff = _mm_cmpeq_epi32(ffff, abcc);
combined = _mm_or_si128(combined, ffff);
// results of all the compares are ORed together. All zero only if there were no hits
unsigned equal_mask = _mm_movemask_epi8(combined);
equal_mask &= 0x0fff; // the high 32b element could have false positives
return equal_mask;
// return !!equal_mask if you want to force it to 0 or 1
// the mask tells you whether it was a, b, or c that had a hit
// movmskps would return a mask of just 4 bits, one for each 32b element, but might have a bypass delay on Nehalem.
// actually, pmovmskb apparently runs in the float domain on Nehalem anyway, according to Agner Fog's table >.<
}
This compiles to pretty nice asm, pretty similar between clang and gcc, but clang's -fverbose-asm puts nice comments on the shuffles. Only 19 instructions including the ret, with a decent amount of parallelism from separate dependency chains. With -msse4.1, or -mavx, it saves another couple of instructions. (But probably doesn't run any faster)
With clang, dawg's version is about twice the size. With gcc, something bad happens and it's horrible (over 80 instructions. Looks like a gcc optimization bug, since it looks worse than just a straightforward translation of the source). Even clang's version spends so long getting data into / out of vector regs that it might be faster to just do the comparisons branchlessly and OR the truth values together.
This compiles to decent code:
// 8bit variable doesn't help gcc avoid partial-register stalls even with -mtune=core2 :/
int eq3_scalar(int a, int b, int c, int d, int e, int f){
char retval = (a == d) | (a == e) | (a == f)
| (b == d) | (b == e) | (b == f)
| (c == d) | (c == e) | (c == f);
return retval;
}
Play around with how to get the data from the caller into vector regs.
If the groups of three are coming from memory, then prob. passing pointers so a vector load can get them from their original location is best. Going through integer registers on the way to vectors sucks (higher latency, more uops), but if your data is already live in regs it's a loss to do integer stores and then vector loads. gcc is dumb and follows the AMD optimization guide's recommendation to bounce through memory, even though Agner Fog says he's found that's not worth it even on AMD CPUs. It's definitely worse on Intel, and apparently a wash or maybe still worse on AMD, so it's definitely the wrong choice for -mtune=generic. Anyway...
It's also possible to do 8 of our 9 compares with just two packed-vector compares.
The 9th can be done with an integer compare, and have its truth value ORed with the vector result. On some CPUs (esp. AMD, and maybe Intel Haswell and later) not transferring one of the 6 integers to vector regs at all might be a win. Mixing three integer branchless-compares in with the vector shuffles / compares would interleave them nicely.
These vector comparisons can be set up by using shufps on integer data (since it can combine data from two source registers). That's fine on most CPUs, but requires a lot of annoying casting when using intrinsics instead of actual asm. Even if there is a bypass delay, it's not a bad tradeoff vs. something like punpckldq and then pshufd.
aabb ccab
==== ====
dede deff
c==f
with asm something like:
#### untested
# pretend a is in eax, and so on
movd xmm0, eax
movd xmm1, ebx
movd xmm2, ecx
shl rdx, 32
#mov edi, edi # zero the upper 32 of rdi if needed, or use shld instead of OR if you don't care about AMD CPUs
or rdx, rdi # de in an integer register.
movq xmm3, rdx # de, aka (d<<32)|e
# in 32bit code, use a vector shuffle of some sort to do this in a vector reg, or:
#pinsrd xmm3, edi, 1 # SSE4.1, and 2 uops (same as movd+shuffle)
#movd xmm4, edi # e
movd xmm5, esi # f
shufps xmm0, xmm1, 0 # xmm0=aabb (low dword = a; my notation is backwards from left/right vector-shift perspective)
shufps xmm5, xmm3, 0b01000000 # xmm5 = ffde
punpcklqdq xmm3, xmm3 # broadcast: xmm3=dede
pcmpeqd xmm3, xmm0 # xmm3: aabb == dede
# spread these instructions out between vector instructions, if you aren't branching
xor edx,edx
cmp esi, ecx # c == f
#je .found_match # if there's one of the 9 that's true more often, make it this one. Branch mispredicts suck, though
sete dl
shufps xmm0, xmm2, 0b00001000 # xmm0 = abcc
pcmpeqd xmm0, xmm5 # abcc == ffde
por xmm0, xmm3
pmovmskb eax, xmm0 # will have bits set if cmpeq found any equal elements
or eax, edx # combine vector and scalar compares
jnz .found_match
# or record the result instead of branching on it
setnz dl
This is also 19 instructions (not counting the final jcc / setcc), but one of them is an xor-zeroing idiom, and there are other simple integer instructions. (Shorter encoding, some can run on port6 on Haswell+ which can't handle vector instructions). There might be a longer dep chain due to the chain of shuffles that builds abcc.
If you want a bitwise version look to xor. If you xor two numbers that are the same the answer will be 0. Otherwise, the bits will flip if one is set and the other is not. For example 1000 xor 0100 is 1100.
The code you have will likely cause at least 1 pipeline flush but apart from that it will be ok performance wise.
I think using SSE is probably worth investigating.
It has been 20 years since I wrote any, and not benchmarked, but something like:
#include <xmmintrin.h>
int cmp3(int32_t a, int32_t b, int32_t c, int32_t d, int32_t e, int32_t f){
// returns -1 if any of a,b,c is eq to any of d,e,f
// returns 0 if all a,b,c != d,e,f
int32_t __attribute__ ((aligned(16))) vec1[4];
int32_t __attribute__ ((aligned(16))) vec2[4];
int32_t __attribute__ ((aligned(16))) vec3[4];
int32_t __attribute__ ((aligned(16))) vec4[4];
int32_t __attribute__ ((aligned(16))) r1[4];
int32_t __attribute__ ((aligned(16))) r2[4];
int32_t __attribute__ ((aligned(16))) r3[4];
// fourth word is DNK
vec1[0]=a;
vec1[1]=b;
vec1[2]=c;
vec2[0]=vec2[1]=vec2[2]=d;
vec3[0]=vec3[1]=vec3[2]=e;
vec4[0]=vec4[1]=vec4[2]=f;
__m128i v1 = _mm_load_si128((__m128i *)vec1);
__m128i v2 = _mm_load_si128((__m128i *)vec2);
__m128i v3 = _mm_load_si128((__m128i *)vec3);
__m128i v4 = _mm_load_si128((__m128i *)vec4);
// any(a,b,c) == d?
__m128i vcmp1 = _mm_cmpeq_epi32(v1, v2);
// any(a,b,c) == e?
__m128i vcmp2 = _mm_cmpeq_epi32(v1, v3);
// any(a,b,c) == f?
__m128i vcmp3 = _mm_cmpeq_epi32(v1, v4);
_mm_store_si128((__m128i *)r1, vcmp1);
_mm_store_si128((__m128i *)r2, vcmp2);
_mm_store_si128((__m128i *)r3, vcmp3);
// bit or the first three of each result.
// might be better with SSE mask, but I don't remember how!
return r1[0] | r1[1] | r1[2] |
r2[0] | r2[1] | r2[2] |
r3[0] | r3[1] | r3[2];
}
If done correctly, SSE with no branches should be 4x to 8x faster.
If your compiler/architecture supports vector extensions (like clang and gcc) you can use something like:
#ifdef __SSE2__
#include <immintrin.h>
#elif defined __ARM_NEON
#include <arm_neon.h>
#elif defined __ALTIVEC__
#include <altivec.h>
//#elif ... TODO more architectures
#endif
static int hastrue128(void *x){
#ifdef __SSE2__
return _mm_movemask_epi8(*(__m128i*)x);
#elif defined __ARM_NEON
return vaddlvq_u8(*(uint8x16_t*)x);
#elif defined __ALTIVEC__
typedef __UINT32_TYPE__ v4si __attribute__ ((__vector_size__ (16), aligned(4), __may_alias__));
return vec_any_ne(*(v4si*)x,(v4si){0});
#else
int *y = x;
return y[0]|y[1]|y[2]|y[3];
#endif
}
//if inputs will always be aligned to 16 add an aligned attribute
//otherwise ensure they are at least aligned to 4
int cmp3( int* a , int* b ){
typedef __INT32_TYPE__ i32x4 __attribute__ ((__vector_size__ (16), aligned(4), __may_alias__));
i32x4 x = *(i32x4*)a, cmp, tmp, y0 = y0^y0, y1 = y0, y2 = y0;
//start vectors off at 0 and add the int to each element for optimization
//it adds the int to each element, but since we started it at zero,
//a good compiler (not ICC at -O3) will skip the xor and add and just broadcast/whatever
y0 += b[0];
y1 += b[1];
y2 += b[2];
cmp = x == y0;
tmp = x == y1; //ppc complains if we don't use temps here
cmp |= tmp;
tmp = x ==y2;
cmp |= tmp;
//now hack off then end since we only need 3
cmp &= (i32x4){0xffffffff,0xffffffff,0xffffffff,0};
return hastrue128(&cmp);
}
int cmp4( int* a , int* b ){
typedef __INT32_TYPE__ i32x4 __attribute__ ((__vector_size__ (16), aligned(4), __may_alias__));
i32x4 x = *(i32x4*)a, cmp, tmp, y0 = y0^y0, y1 = y0, y2 = y0, y3 = y0;
y0 += b[0];
y1 += b[1];
y2 += b[2];
y3 += b[3];
cmp = x == y0;
tmp = x == y1; //ppc complains if we don't use temps here
cmp |= tmp;
tmp = x ==y2;
cmp |= tmp;
tmp = x ==y3;
cmp |= tmp;
return hastrue128(&cmp);
}
On arm64 this compiles to the following branchless code:
cmp3:
ldr q2, [x0]
adrp x2, .LC0
ld1r {v1.4s}, [x1]
ldp w0, w1, [x1, 4]
dup v0.4s, w0
cmeq v1.4s, v2.4s, v1.4s
dup v3.4s, w1
ldr q4, [x2, #:lo12:.LC0]
cmeq v0.4s, v2.4s, v0.4s
cmeq v2.4s, v2.4s, v3.4s
orr v0.16b, v1.16b, v0.16b
orr v0.16b, v0.16b, v2.16b
and v0.16b, v0.16b, v4.16b
uaddlv h0,v0.16b
umov w0, v0.h[0]
uxth w0, w0
ret
cmp4:
ldr q2, [x0]
ldp w2, w0, [x1, 4]
dup v0.4s, w2
ld1r {v1.4s}, [x1]
dup v3.4s, w0
ldr w1, [x1, 12]
dup v4.4s, w1
cmeq v1.4s, v2.4s, v1.4s
cmeq v0.4s, v2.4s, v0.4s
cmeq v3.4s, v2.4s, v3.4s
cmeq v2.4s, v2.4s, v4.4s
orr v0.16b, v1.16b, v0.16b
orr v0.16b, v0.16b, v3.16b
orr v0.16b, v0.16b, v2.16b
uaddlv h0,v0.16b
umov w0, v0.h[0]
uxth w0, w0
ret
And on ICC x86_64 -march=skylake it produces the following branchless code:
cmp3:
vmovdqu xmm2, XMMWORD PTR [rdi] #27.24
vpbroadcastd xmm0, DWORD PTR [rsi] #34.17
vpbroadcastd xmm1, DWORD PTR [4+rsi] #35.17
vpcmpeqd xmm5, xmm2, xmm0 #34.17
vpbroadcastd xmm3, DWORD PTR [8+rsi] #37.16
vpcmpeqd xmm4, xmm2, xmm1 #35.17
vpcmpeqd xmm6, xmm2, xmm3 #37.16
vpor xmm7, xmm4, xmm5 #36.5
vpor xmm8, xmm6, xmm7 #38.5
vpand xmm9, xmm8, XMMWORD PTR __$U0.0.0.2[rip] #40.5
vpmovmskb eax, xmm9 #11.12
ret #41.12
cmp4:
vmovdqu xmm3, XMMWORD PTR [rdi] #46.24
vpbroadcastd xmm0, DWORD PTR [rsi] #51.17
vpbroadcastd xmm1, DWORD PTR [4+rsi] #52.17
vpcmpeqd xmm6, xmm3, xmm0 #51.17
vpbroadcastd xmm2, DWORD PTR [8+rsi] #54.16
vpcmpeqd xmm5, xmm3, xmm1 #52.17
vpbroadcastd xmm4, DWORD PTR [12+rsi] #56.16
vpcmpeqd xmm7, xmm3, xmm2 #54.16
vpor xmm8, xmm5, xmm6 #53.5
vpcmpeqd xmm9, xmm3, xmm4 #56.16
vpor xmm10, xmm7, xmm8 #55.5
vpor xmm11, xmm9, xmm10 #57.5
vpmovmskb eax, xmm11 #11.12
ret
And it even works on ppc64 with altivec - though definitely suboptimal
cmp3:
lwa 10,4(4)
lxvd2x 33,0,3
vspltisw 11,-1
lwa 9,8(4)
vspltisw 12,0
xxpermdi 33,33,33,2
lwa 8,0(4)
stw 10,-32(1)
addi 10,1,-80
stw 9,-16(1)
li 9,32
stw 8,-48(1)
lvewx 0,10,9
li 9,48
xxspltw 32,32,3
lvewx 13,10,9
li 9,64
vcmpequw 0,1,0
lvewx 10,10,9
xxsel 32,44,43,32
xxspltw 42,42,3
xxspltw 45,45,3
vcmpequw 13,1,13
vcmpequw 1,1,10
xxsel 45,44,43,45
xxsel 33,44,43,33
xxlor 32,32,45
xxlor 32,32,33
vsldoi 1,12,11,12
xxland 32,32,33
vcmpequw. 0,0,12
mfcr 3,2
rlwinm 3,3,25,1
cntlzw 3,3
srwi 3,3,5
blr
cmp4:
lwa 10,8(4)
lxvd2x 33,0,3
vspltisw 10,-1
lwa 9,12(4)
vspltisw 11,0
xxpermdi 33,33,33,2
lwa 7,0(4)
lwa 8,4(4)
stw 10,-32(1)
addi 10,1,-96
stw 9,-16(1)
li 9,32
stw 7,-64(1)
stw 8,-48(1)
lvewx 0,10,9
li 9,48
xxspltw 32,32,3
lvewx 13,10,9
li 9,64
xxspltw 45,45,3
vcmpequw 13,1,13
xxsel 44,43,42,45
lvewx 13,10,9
li 9,80
vcmpequw 0,1,0
xxspltw 45,45,3
xxsel 32,43,42,32
vcmpequw 13,1,13
xxlor 32,32,44
xxsel 45,43,42,45
lvewx 12,10,9
xxlor 32,32,45
xxspltw 44,44,3
vcmpequw 1,1,12
xxsel 33,43,42,33
xxlor 32,32,33
vcmpequw. 0,0,11
mfcr 3,2
rlwinm 3,3,25,1
cntlzw 3,3
srwi 3,3,5
blr
As you can see from the generated asm, there is still a little room for improvement, but it will compile on risc-v, mips, ppc and other architecture+compiler combinations that support vector extensions.