Optimized 2x2 matrix multiplication: Slow assembly versus fast SIMD

Optimized 2x2 matrix multiplication: Slow assembly versus fast SIMD - c

Problem
I am studying high performance matrix multiplication algorithms such as OpenBLAS or GotoBLAS and I am trying to reproduce some of the results. This question deals with the inner kernel of a matrix multiplication algorithm. Specifically, I am looking at computing C += AB, where A and B are 2x2 matrices of type double at the peak speed of my CPU. There are two ways to do this. One way is to use SIMD instructions. The second way is to code directly in assembly using the SIMD registers.
What I have looked at so far
All the relevant papers, course webpages, many many SO Q&As dealing with the subject (too many to list), I have compiled OpenBLAS on my computer, looked through OpenBLAS, GotoBLAS, and BLIS source codes, Agner's manuals.
Hardware
My CPU is an Intel i5 - 540M. You can find the relevant CPUID information on cpu-world.com. The microarchitecture is Nehalem (westmere), so it can theoretically compute 4 double precision flops per core per cycle. I will be using just one core (no OpenMP), so with hyperthreading off and 4-step Intel Turbo Boost, I should be seeing a peak of ( 2.533 Ghz + 4*0.133 Ghz ) * ( 4 DP flops/core/cycle ) * ( 1 core ) = 12.27 DP Gflops. For reference, with both cores running at peak, Intel Turbo Boost gives a 2-step speed up and I should get a theoretical peak of 22.4 DP Gflops.
Setup
I declare my 2x2 matrices as double and initialize them with random entries as shown in the code snippet below.
srand(time(NULL));
const int n = 2;
double A[n*n];
double B[n*n];
double C[n*n];
double T[n*n];
for(int i = 0; i < n*n; i++){
A[i] = (double) rand()/RAND_MAX;
B[i] = (double) rand()/RAND_MAX;
C[i] = 0.0;
}
I compute a true answer using naive matrix-matrix multiplcation (shown below) which allows me to check my result either visually or by computing the L2 norm of all the elements
// "true" answer
for(int i = 0; i < n; i++)
for(int j = 0; j < n; j++)
for(int k = 0; k < n; k++)
T[i*n + j] += A[i*n + k]*B[k*n + j];
To run the code and get an estimate of the Gflops, I call each multiplication function once to warmup, and then execute it inside a for loop for maxiter times, making sure to zero the C matrix each time as I am computing C += AB. The for loop is placed inside two clock() statements and this is used to estimate the Gflops. The code snippet blow illustrates this part.
C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
mult2by2(A,B,C); //warmup
time1 = clock();
for(int i = 0; i < maxiter; i++){
mult2by2(A,B,C);
C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
}
time2 = clock() - time1;
time3 = (double)(time2)/CLOCKS_PER_SEC;
gflops = (double) (2.0*n*n*n)/time3/1.0e9*maxiter;
mult2by2(A,B,C); // to compute the norm against T
norm = L2norm(n,C,T);
SIMD code
My CPU supports 128-bit vectors, so I can fit 2 doubles in each vector. This is the main reason why I am doing 2x2 matrix multiplication in the inner kernel. The SIMD code computes an entire row of C at one time.
inline void
__attribute__ ((gnu_inline))
__attribute__ ((aligned(16))) mult2by2B(
const double* restrict A,
const double* restrict B,
double* restrict C
)
{
register __m128d xmm0, xmm1, xmm2, xmm3, xmm4;
xmm0 = _mm_load_pd(C);
xmm1 = _mm_load1_pd(A);
xmm2 = _mm_load_pd(B);
xmm3 = _mm_load1_pd(A + 1);
xmm4 = _mm_load_pd(B + 2);
xmm1 = _mm_mul_pd(xmm1,xmm2);
xmm2 = _mm_add_pd(xmm1,xmm0);
xmm1 = _mm_mul_pd(xmm3,xmm4);
xmm2 = _mm_add_pd(xmm1,xmm2);
_mm_store_pd(C,xmm2);
xmm0 = _mm_load_pd(C + 2);
xmm1 = _mm_load1_pd(A + 2);
xmm2 = _mm_load_pd(B);
xmm3 = _mm_load1_pd(A + 3);
//xmm4 = _mm_load_pd(B + 2);
xmm1 = _mm_mul_pd(xmm1,xmm2);
xmm2 = _mm_add_pd(xmm1,xmm0);
xmm1 = _mm_mul_pd(xmm3,xmm4);
xmm2 = _mm_add_pd(xmm1,xmm2);
_mm_store_pd(C + 2,xmm2);
}
Assmebly (Intel Syntax)
My first attempt was to create a separate assembly routine for this part and call it from the main routine. However, it was extremely slow because I cannot inline extern functions. I wrote the assembly as inline assembly as shown below. It is identical to that which is produced by gcc -S -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-all-loops -fomit-frame-pointer -masm=intel. From what I understand of the Nehalem microarchitecture diagrams, this processor can perform SSE ADD, SSE MUL, and SSE MOV in parallel, which explains the interleaving of MUL, ADD, MOV instructions. You will notice the SIMD instructions above are in a different order because I had a different understanding from Agner Fog's manuals. Nevertheless, gcc is smart and the SIMD code above compiles to the assembly shown in the inline version.
inline void
__attribute__ ((gnu_inline))
__attribute__ ((aligned(16))) mult2by2A
(
const double* restrict A,
const double* restrict B,
double* restrict C
)
{
__asm__ __volatile__
(
"mov edx, %[A] \n\t"
"mov ecx, %[B] \n\t"
"mov eax, %[C] \n\t"
"movapd xmm3, XMMWORD PTR [ecx] \n\t"
"movapd xmm2, XMMWORD PTR [ecx+16] \n\t"
"movddup xmm1, QWORD PTR [edx] \n\t"
"mulpd xmm1, xmm3 \n\t"
"addpd xmm1, XMMWORD PTR [eax] \n\t"
"movddup xmm0, QWORD PTR [edx+8] \n\t"
"mulpd xmm0, xmm2 \n\t"
"addpd xmm0, xmm1 \n\t"
"movapd XMMWORD PTR [eax], xmm0 \n\t"
"movddup xmm4, QWORD PTR [edx+16] \n\t"
"mulpd xmm4, xmm3 \n\t"
"addpd xmm4, XMMWORD PTR [eax+16] \n\t"
"movddup xmm5, QWORD PTR [edx+24] \n\t"
"mulpd xmm5, xmm2 \n\t"
"addpd xmm5, xmm4 \n\t"
"movapd XMMWORD PTR [eax+16], xmm5 \n\t"
: // no outputs
: // inputs
[A] "m" (A),
[B] "m" (B),
[C] "m" (C)
: //register clobber
"memory",
"edx","ecx","eax",
"xmm0","xmm1","xmm2","xmm3","xmm4","xmm5"
);
}
Results
I compile my code with the following flags:
gcc -std=c99 -O3 -msse3 -ffast-math -march=nocona -mtune=nocona -funroll-all-loops -fomit-frame-pointer -masm=intel
The results for maxiter = 1000000000 are below:
********** Inline ASM
L2 norm: 0.000000e+000, Avg. CPU time: 9.563000, Avg. Gflops: 1.673115
********** SIMD Version
L2 norm: 0.000000e+000, Avg. CPU time: 0.359000, Avg. Gflops: 44.568245
If I force the SIMD version to not be inlined with __attribute__ ((noinline)), the results are:
********** Inline ASM
L2 norm: 0.000000e+000, Avg. CPU time: 11.155000, Avg. Gflops: 1.434334
********** SIMD Version
L2 norm: 0.000000e+000, Avg. CPU time: 11.264000, Avg. Gflops: 1.420455
Questions
If both the inline ASM and SIMD implementations produce identical assembly output, why is the assembly version so much slower? It's as if the inline assembly did not get inlined, which is made evident by the second set of results showing identical performance for "inline" ASM versus "noinline" SIMD. The only explanation I can find is in Agner Fog Volume 2 page 6:
Compiled code may be faster than assembly code because compilers can make
inter-procedural optimization and whole-program optimization. The assembly
programmer usually has to make well-defined functions with a well-defined call
interface that obeys all calling conventions in order to make the code testable and
verifiable. This prevents many of the optimization methods that compilers use, such
as function inlining, register allocation, constant propagation, common subexpression
elimination across functions, scheduling across functions, etc. These
advantages can be obtained by using C++ code with intrinsic functions instead of
assembly code.
But the assembler output for both versions is exactly the same.
Why am I seeing 44 Gflops in the first set of results? This is way above the 12 Gflops peak I calculated, and is what I would expect if I was running both cores with single precision calculations.
EDIT 1
The comment says there might be dead code elimination I can confirm that this is happening for the SIMd instructions. The -S output shows that the for loop for SIMD only zeros C matrix. I can disable that by turning off compiler optimization with -O0. In that case, SIMD runs 3x as slow as ASM, but ASM still runs at exactly the same speed. The norm is also nonzero now, but it's still OK at 10^-16. I also see that the inline ASM version is being inlined with the APP and NO_APP tags, but it is also being unrolled 8 times in the for loop. I think unrolling that many times will impact performance heavily, as I usually unroll loops 4 times. Anything more, in my experience, seems to degrade performance.

GCC is optimizing away your inline function using intrinsics, mult2by2B, due to the line
C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
Without that line it takes 2.9 seconds on the computer from Coliru
http://coliru.stacked-crooked.com/a/992304f5f672e257
And with the line it only takes 0.000001
http://coliru.stacked-crooked.com/a/9722c39bb6b8590a
You can also see this in the assembly. If you drop the code below into http://gcc.godbolt.org/ you will see that with that line of code it skips the function entirely.
However, when you inline the assembly GCC is NOT optimizing the function, mult2by2A, away (even though it inlines it). You can see this in the assembly as well.
#include <stdio.h>
#include <emmintrin.h> // SSE2
#include <omp.h>
inline void
__attribute__ ((gnu_inline))
__attribute__ ((aligned(16))) mult2by2B(
const double* __restrict A,
const double* __restrict B,
double* __restrict C
)
{
register __m128d xmm0, xmm1, xmm2, xmm3, xmm4;
xmm0 = _mm_load_pd(C);
xmm1 = _mm_load1_pd(A);
xmm2 = _mm_load_pd(B);
xmm3 = _mm_load1_pd(A + 1);
xmm4 = _mm_load_pd(B + 2);
xmm1 = _mm_mul_pd(xmm1,xmm2);
xmm2 = _mm_add_pd(xmm1,xmm0);
xmm1 = _mm_mul_pd(xmm3,xmm4);
xmm2 = _mm_add_pd(xmm1,xmm2);
_mm_store_pd(C,xmm2);
xmm0 = _mm_load_pd(C + 2);
xmm1 = _mm_load1_pd(A + 2);
xmm2 = _mm_load_pd(B);
xmm3 = _mm_load1_pd(A + 3);
//xmm4 = _mm_load_pd(B + 2);
xmm1 = _mm_mul_pd(xmm1,xmm2);
xmm2 = _mm_add_pd(xmm1,xmm0);
xmm1 = _mm_mul_pd(xmm3,xmm4);
xmm2 = _mm_add_pd(xmm1,xmm2);
_mm_store_pd(C + 2,xmm2);
}
int main() {
double A[4], B[4], C[4];
int maxiter = 10000000;
//int maxiter = 1000000000;
double dtime;
dtime = omp_get_wtime();
for(int i = 0; i < maxiter; i++){
mult2by2B(A,B,C);
C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
}
dtime = omp_get_wtime() - dtime;
printf("%f %f %f %f\n", C[0], C[1], C[2], C[3]);
//gflops = (double) (2.0*n*n*n)/time3/1.0e9*maxiter;
printf("time %f\n", dtime);
}

Related

Strength reduction on floating point division by hand

In one of our last assignments in Computer Science this term we have to apply strength reduction on some code fragments. Most of them were just straight forward, especially with looking into compiler output. But one of them I wont be able to solve, even with the help of the compiler.
Our profs gave us the following hint:
Hint: Inquire how IEEE 754 single-precision floating-point numbers are
represented in memory.
Here is the code snippet: (a is of type double*)
for (int i = 0; i < N; ++i) {
a[i] += i / 5.3;
}
At first I tried to look into the compiler output for this snipped on godbolt. I tried to compile it without any optimization: (note: I copied only the relevant part in the for loop)
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*8]
mov rax, QWORD PTR [rbp-16]
add rax, rdx
movsd xmm1, QWORD PTR [rax]
cvtsi2sd xmm0, DWORD PTR [rbp-4] //division relevant
movsd xmm2, QWORD PTR .LC0[rip] //division relevant
divsd xmm0, xmm2 //division relevant
mov eax, DWORD PTR [rbp-4]
cdqe
lea rdx, [0+rax*8]
mov rax, QWORD PTR [rbp-16]
add rax, rdx
addsd xmm0, xmm1
movsd QWORD PTR [rax], xmm0
and with -O3:
.L2:
pshufd xmm0, xmm2, 238 //division relevant
cvtdq2pd xmm1, xmm2 //division relevant
movupd xmm6, XMMWORD PTR [rax]
add rax, 32
cvtdq2pd xmm0, xmm0 //division relevant
divpd xmm1, xmm3 //division relevant
movupd xmm5, XMMWORD PTR [rax-16]
paddd xmm2, xmm4
divpd xmm0, xmm3 //division relevant
addpd xmm1, xmm6
movups XMMWORD PTR [rax-32], xmm1
addpd xmm0, xmm5
movups XMMWORD PTR [rax-16], xmm0
cmp rax, rbp
jne .L2
I commented the division part of the assembly code. But this output does not help me understanding how to apply strength reduction on the snippet. (Maybe there are too many optimizations going on to fully understand the output)
Second, I tried to understand the bit representation of the float part 5.3.
Which is:
0 |10000001|01010011001100110011010
Sign|Exponent|Mantissa
But this does not help me either.

If we adopt Wikipedia's definition that
strength reduction is a compiler optimization where expensive operations are replaced with equivalent but less expensive operations
then we can apply strength reduction here by converting the expensive floating-point division into a floating-point multiply plus two floating-point multiply-adds (FMAs). Assuming that double is mapped to IEEE-754 binary64, the default rounding mode for floating-point computation is round-to-nearest-or-even, and that int is a 32-bit type, we can prove the transformation correct by simple exhaustive test:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <math.h>
int main (void)
{
const double rcp_5p3 = 1.0 / 5.3; // 0x1.826a439f656f2p-3
int i = INT_MAX;
do {
double ref = i / 5.3;
double res = fma (fma (-5.3, i * rcp_5p3, i), rcp_5p3, i * rcp_5p3);
if (res != ref) {
printf ("error: i=%2d res=%23.13a ref=%23.13a\n", i, res, ref);
return EXIT_FAILURE;
}
i--;
} while (i >= 0);
return EXIT_SUCCESS;
}
Most modern instances of common processors architectures like x86-64 and ARM64 have hardware support for FMA, such that fma() can be mapped directly to the appropriate hardware instruction. This should be confirmed by looking at the disassembly of the binary generated. Where hardware support for FMA is lacking the transformation obviously should not be applied, as software implementations of fma() are slow and sometimes functionally incorrect.
The basic idea here is that mathematically, division is equivalent to multiplication with the reciprocal. However, that is not necessarily true for finite-precision floating-point arithmetic. The code above tries to improve the likelihood of bit-accurate computation by determining the error in the naive approach with the help of FMA and applying a correction where necessary. For background including literature references see this earlier question.
To the best of my knowledge, there is not yet a general mathematically proven algorithm to determine for which divisors paired with which dividends the above transformation is safe (that is, delivers bit-accurate results), which is why an exhaustive test is strictly necessary to show that the transformation is valid.
In comments, Pascal Cuoq points out that there is an alternative algorithm to potentially strength-reduce floating-point division with a compile-time constant divisor, by precomputing the reciprocal of the divisor to more than native precision and specifically as a double-double. For background see N. Brisebarre and J.-M. Muller, "Correctly rounded multiplication by arbirary precision constant", IEEE Transactions on Computers, 57(2): 162-174, February 2008, which also provides guidance how to determine whether that transformation is safe for any particular constant. Since the present case is simple, I again used exhaustive test to show it is safe. Where applicable, this will reduce the division down to one FMA plus a multiply:
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <mathimf.h>
int main (void)
{
const double rcp_5p3_hi = 1.8867924528301888e-1; // 0x1.826a439f656f2p-3
const double rcp_5p3_lo = -7.2921377017921457e-18;// -0x1.0d084b1883f6e0p-57
int i = INT_MAX;
do {
double ref = i / 5.3;
double res = fma (i, rcp_5p3_hi, i * rcp_5p3_lo);
if (res != ref) {
printf ("i=%2d res=%23.13a ref=%23.13a\n", i, res, ref);
return EXIT_FAILURE;
}
i--;
} while (i >= 0);
return EXIT_SUCCESS;
}

To cover another aspect: since all values of type int are exactly representable as double (but not as float), it is possible to get rid of int-to-double conversion that happens in the loop when evaluating i / 5.3 by introducing a floating-point variable that counts from 0.0 to N:
double fp_i = 0;
for (int i = 0; i < N; fp_i += 1, i++)
a[i] += fp_i / 5.3;
However, this kills autovectorization, and introduces a chain of dependent floating-point additions. Floating point addition is typically 3 or 4 cycles, so the last iteration will retire after at least (N-1)*3 cycles, even if the CPU could dispatch the instructions in the loop faster. Thankfully, floating-point division is not fully pipelined, and the rate at which an x86 CPU can dispatch floating-point division roughly matches or exceeds latency of the addition instruction.
This leaves the problem of killed vectorization. It's possible to bring it back by manually unrolling the loop and introducing two independent chains, but with AVX you'd need four chains for full vectorization:
double fp_i0 = 0, fp_i1 = 1;
int i = 0;
for (; i+1 < N; fp_i0 += 2, fp_i1 += 2, i += 2) {
double t0 = a[i], t1 = a[i+1];
a[i] = t0 + fp_i0 / 5.3;
a[i+1] = t1 + fp_i1 / 5.3;
}
if (i < N)
a[i] += i / 5.3;

CAVEAT: After a few days I realized that this answer is incorrect in that it ignores the consequence of underflow (to subnormal or to zero) in the computation of o / 5.3. In this case, multiplying the result by a power of two is “exact” but does not produce the result that dividing the larger integer by 5.3 would have.
i / 5.3 only needs to be computed for odd values of i.
For even values of i, you can simply multiply by 2.0 the value of (i/2)/5.3, which was already computed earlier in the loop.
The remaining difficulty is to reorder the iterations in a way such that each index between 0 and N-1 is handled exactly once and the program does not need to record an arbitrary number of division results.
One way to achieve this is to iterate on all odd numbers o less than N, and after computing o / 5.3 in order to handle index o, to also handle all indexes of the form o * 2**p.
if (N > 0) {
a[0] += 0.0; // this is needed for strict IEEE 754 compliance lol
for (int o = 1; o < N; o += 2) {
double d = o / 5.3;
int i = o;
do {
a[i] += d;
i += i;
d += d;
} while (i < N);
}
}
Note: this does not use the provided hint “Inquire how IEEE 754 single-precision floating-point numbers are represented in memory”. I think I know pretty well how single-precision floating-point numbers are represented in memory, but I do not see how that is relevant, especially since there are no single-precision values or computations in the code to optimize. I think there is a mistake in the way the problem is expressed, but still the above is technically a partial answer to the question as phrased.
I also ignored overflow problems for values of N that come close to INT_MAX in the code snippet above, since the code is already complicated enough.
As an additional note, the above transformation only replaces one division out of two. It does this by making the code unvectorizable (and also less cache-friendly). In your question, gcc -O3 has already shown that automatic vectorization could be applied to the starting point that your professor suggested, and that is likely to be more beneficial than suppressing half the divisions can be. The only good thing about the transformation in this answer is that it is a sort of strength reduction, which your professor requested.

Why vector length SIMD code is slower than plain C

Why is my SIMD vector4 length function 3x slower than a naive vector length method?
SIMD vector4 length function:
__extern_always_inline float vec4_len(const float *v) {
__m128 vec1 = _mm_load_ps(v);
__m128 xmm1 = _mm_mul_ps(vec1, vec1);
__m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
__m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
return sqrtf(_mm_cvtss_f32(xmm3));
}
Naive implementation:
sqrtf(V[0] * V[0] + V[1] * V[1] + V[2] * V[2] + V[3] * V[3])
The SIMD version took 16110ms to iterate 1000000000 times. The naive version was ~3 times faster, it takes only 4746ms.
#include <math.h>
#include <time.h>
#include <stdint.h>
#include <stdio.h>
#include <x86intrin.h>
static float vec4_len(const float *v) {
__m128 vec1 = _mm_load_ps(v);
__m128 xmm1 = _mm_mul_ps(vec1, vec1);
__m128 xmm2 = _mm_hadd_ps(xmm1, xmm1);
__m128 xmm3 = _mm_hadd_ps(xmm2, xmm2);
return sqrtf(_mm_cvtss_f32(xmm3));
}
int main() {
float A[4] __attribute__((aligned(16))) = {3, 4, 0, 0};
struct timespec t0 = {};
clock_gettime(CLOCK_MONOTONIC, &t0);
double sum_len = 0;
for (uint64_t k = 0; k < 1000000000; ++k) {
A[3] = k;
sum_len += vec4_len(A);
// sum_len += sqrtf(A[0] * A[0] + A[1] * A[1] + A[2] * A[2] + A[3] * A[3]);
}
struct timespec t1 = {};
clock_gettime(CLOCK_MONOTONIC, &t1);
fprintf(stdout, "%f\n", sum_len);
fprintf(stdout, "%ldms\n", (((t1.tv_sec - t0.tv_sec) * 1000000000) + (t1.tv_nsec - t0.tv_nsec)) / 1000000);
return 0;
}
I run with the following command on an Intel(R) Core(TM) i7-8550U CPU. First with the vec4_len version then with the plain C.
I compile with GCC (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0:
gcc -Wall -Wextra -O3 -msse -msse3 sse.c -lm && ./a.out
SSE version output:
499999999500000128.000000
13458ms
Plain C version output:
499999999500000128.000000
4441ms

The most obvious problem is using an inefficient dot-product (with haddps which costs 2x shuffle uops + 1x add uop) instead of shuffle + add. See Fastest way to do horizontal float vector sum on x86 for what to do after _mm_mul_ps that doesn't suck as much. But still this is just not something x86 can do very efficiently.
But anyway, the real problem is your benchmark loop.
A[3] = k; and then using _mm_load_ps(A) creates a store-forwarding stall, if it compiles naively instead of to a vector shuffle. A store + reload can be efficiently forwarded with ~5 cycles of latency if the load only loads data from a single store instruction, and no data outside that. Otherwise it has to do a slower scan of the whole store buffer to assemble bytes. This adds about 10 cycles of latency to the store-forwarding.
I'm not sure how much impact this has on throughput, but could be enough to stop out-of-order exec from overlapping enough loop iterations to hide the latency and only bottleneck on sqrtss shuffle throughput.
(Your Coffee Lake CPU has 1 per 3 cycle sqrtss throughput, so surprisingly SQRT throughput is not your bottleneck.1 Instead it will be shuffle throughput or something else.)
See Agner Fog's microarch guide and/or optimization manual.
What does "store-buffer forwarding" mean in the Intel developer's manual?
How does store to load forwarding happens in case of unaligned memory access?
Can modern x86 implementations store-forward from more than one prior store?
Why would a compiler generate this assembly? quotes Intel's optimization manual re: store forwarding. (In that question, and old gcc version stored the 2 dword halves of an 8-byte struct separately, then copied the struct with a qword load/store. Super braindead.)
Plus you're biasing this even more against SSE by letting the compiler hoist the computation of V[0] * V[0] + V[1] * V[1] + V[2] * V[2] out of the loop.
That part of the expression is loop-invariant, so the compiler only has to do (float)k squared, add, and a scalar sqrt every loop iteration. (And convert that to double to add to your accumulator).
(#StaceyGirl's deleted answer pointed this out; looking over the code of the inner loops in it was a great start on writing this answer.)
Extra inefficiency in A[3] = k in the vector version
GCC9.1's inner loop from Kamil's Godbolt link looks terrible, and seems to include a loop-carried store/reload to merge a new A[3] into the 8-byte A[2..3] pair, further limiting the CPU's ability to overlap multiple iterations.
I'm not sure why gcc thought this was a good idea. It would maybe help on CPUs that split vector loads into 8-byte halves (like Pentium M or Bobcat) to avoid store-forwarding stalls. But that's not a sane tuning for "generic" modern x86-64 CPUs.
.L18:
pxor xmm4, xmm4
mov rdx, QWORD PTR [rsp+8] ; reload A[2..3]
cvtsi2ss xmm4, rbx
mov edx, edx ; truncate RDX to 32-bit
movd eax, xmm4 ; float bit-pattern of (float)k
sal rax, 32
or rdx, rax ; merge the float bit-pattern into A[3]
mov QWORD PTR [rsp+8], rdx ; store A[2..3] again
movaps xmm0, XMMWORD PTR [rsp] ; vector load: store-forwarding stall
mulps xmm0, xmm0
haddps xmm0, xmm0
haddps xmm0, xmm0
ucomiss xmm3, xmm0
movaps xmm1, xmm0
sqrtss xmm1, xmm1
ja .L21 ; call sqrtf to set errno if needed; flags set by ucomiss.
.L17:
add rbx, 1
cvtss2sd xmm1, xmm1
addsd xmm2, xmm1 ; total += (double)sqrtf
cmp rbx, 1000000000
jne .L18 ; }while(k<1000000000);
This insanity isn't present in the scalar version.
Either way, gcc did manage to avoid the inefficiency of a full uint64_t -> float conversion (which x86 doesn't have in hardware until AVX512). It was presumably able to prove that using a signed 64-bit -> float conversion would always work because the high bit can't be set.
Footnote 1: But sqrtps has the same 1 per 3 cycle throughput as scalar, so you're only getting 1/4 of your CPU's sqrt throughput capability by doing 1 vector at a time horizontally, instead of doing 4 lengths for 4 vectors in parallel.

SSE: not seeing a speedup by using _mm_add_epi32

I would expect SSE to be faster than not using SSE. Do I need to add some additional compiler flags? Could it be that I am not seeing a speedup because this is integer code and not floating point?
invocation/output
$ make sum2
clang -O3 -msse -msse2 -msse3 -msse4.1 sum2.c ; ./a.out 123
n: 123
SSE Time taken: 0 seconds 124 milliseconds
vector+vector:begin int: 1 5 127 0
vector+vector:end int: 0 64 66 68
NOSSE Time taken: 0 seconds 115 milliseconds
vector+vector:begin int: 1 5 127 0
vector+vector:end int: 0 64 66 68
compiler
$ clang --version
Apple LLVM version 9.0.0 (clang-900.0.37)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
sum2.c
#include <stdlib.h>
#include <stdio.h>
#include <x86intrin.h>
#include <time.h>
#ifndef __cplusplus
#include <stdalign.h> // C11 defines _Alignas(). This header defines alignas()
#endif
#define CYCLE_COUNT 10000
// add vector and return resulting value on stack
__attribute__((noinline)) __m128i add_iv(__m128i *a, __m128i *b) {
return _mm_add_epi32(*a,*b);
}
// add int vectors via sse
__attribute__((noinline)) void add_iv_sse(__m128i *a, __m128i *b, __m128i *out, int N) {
for(int i=0; i<N/sizeof(int); i++) {
//out[i]= _mm_add_epi32(a[i], b[i]); // this also works
_mm_storeu_si128(&out[i], _mm_add_epi32(a[i], b[i]));
}
}
// add int vectors without sse
__attribute__((noinline)) void add_iv_nosse(int *a, int *b, int *out, int N) {
for(int i=0; i<N; i++) {
out[i] = a[i] + b[i];
}
}
__attribute__((noinline)) void p128_as_int(__m128i in) {
alignas(16) uint32_t v[4];
_mm_store_si128((__m128i*)v, in);
printf("int: %i %i %i %i\n", v[0], v[1], v[2], v[3]);
}
// print first 4 and last 4 elements of int array
__attribute__((noinline)) void debug_print(int *h) {
printf("vector+vector:begin ");
p128_as_int(* (__m128i*) &h[0] );
printf("vector+vector:end ");
p128_as_int(* (__m128i*) &h[32764] );
}
int main(int argc, char *argv[]) {
int n = atoi (argv[1]);
printf("n: %d\n", n);
// sum: vector + vector, of equal length
int f[32768] __attribute__((aligned(16))) = {0,2,4};
int g[32768] __attribute__((aligned(16))) = {1,3,n};
int h[32768] __attribute__((aligned(16)));
f[32765] = 33; f[32766] = 34; f[32767] = 35;
g[32765] = 31; g[32766] = 32; g[32767] = 33;
// https://stackoverflow.com/questions/459691/best-timing-method-in-c
clock_t start = clock();
for(int i=0; i<CYCLE_COUNT; ++i) {
add_iv_sse((__m128i*)f, (__m128i*)g, (__m128i*)h, 32768);
}
int msec = (clock()-start) * 1000 / CLOCKS_PER_SEC;
printf(" SSE Time taken: %d seconds %d milliseconds\n", msec/1000, msec%1000);
debug_print(h);
// process intense function again
start = clock();
for(int i=0; i<CYCLE_COUNT; ++i) {
add_iv_nosse(f, g, h, 32768);
}
msec = (clock()-start) * 1000 / CLOCKS_PER_SEC;
printf("NOSSE Time taken: %d seconds %d milliseconds\n", msec/1000, msec%1000);
debug_print(h);
return EXIT_SUCCESS;
}

Look at the asm: clang -O2 or -O3 probably auto-vectorizes add_iv_nosse (with a check for overlap, since you didn't use int * restrict a and so on).
Use -fno-tree-vectorize to disable auto vectorization, without stopping you from using intrinsics. I'd recommend clang -march=native -mno-avx -O3 -fno-tree-vectorize to test what I think you want to test, scalar integer vs. legacy-SSE paddd. (It works in gcc and clang. In clang, AFAIK it's a synonym for the clang-specific -fno-vectorize.)
BTW, timing both in the same executable hurts the first one, because the CPU doesn't ramp to full turbo right away. You're probably into the timed section of the code before your CPU hits full speed. (So run this a couple times back-to-back, with for i in {1..10}; do time ./a.out; done.
On Linux I'd use perf stat -r5 ./a.out to run it 5 times with performance counters (and I'd split it up so one run tested one or the other, so I could look at perf counters for the whole run.)
Code review:
You forgot stdint.h for uint32_t. I had to add that to get it to compile on Godbolt to see the asm. (Assuming clang-5.0 is something like the Apple clang version you're using. IDK if Apple's clang implies a default -mtune= option, but that would make sense because it's only targeting Mac. Also a baseline SSSE3 would make sense for 64-bit on x86-64 OS X.)
You don't need noinline on debug_print. Also, I'd recommend a different name for CYCLE_COUNT. Cycles in this context makes me think of clock cycles, so call it REP_COUNT or REPEATS or whatever.
Putting your arrays on the stack in main is probably fine. You do initialize both input arrays (to mostly zero, but add performance isn't data-dependent).
This is good, because leaving them uninitialized might mean that multiple 4k pages of each array was copy-on-write mapped to the same physical zero page, so you'd get more than the expected number of L1D cache hits.
The SSE2 loop should bottleneck on L2 / L3 cache bandwidth, since the working set it 4 * 32kiB * 3 = 384 kiB, so it's about 1.5x the 256kiB L2 cache in Intel CPUs.
clang might unroll it's auto-vectorized loop more than it does your manual intrinsics loop. That might explain better performance, since only 16B vectors (not 32B AVX2) might not saturate cache bandwidth if you're not getting 2 loads + 1 store per clock.
Update: actually the loop overhead is pretty extreme, with 3 pointer increments + a loop counter, and only unrolling by 2 to amortize that.
The auto-vectorized loop:
.LBB2_12: # =>This Inner Loop Header: Depth=1
movdqu xmm0, xmmword ptr [r9 - 16]
movdqu xmm1, xmmword ptr [r9] # hoisted load for 2nd unrolled iter
movdqu xmm2, xmmword ptr [r10 - 16]
paddd xmm2, xmm0
movdqu xmm0, xmmword ptr [r10]
paddd xmm0, xmm1
movdqu xmmword ptr [r11 - 16], xmm2
movdqu xmmword ptr [r11], xmm0
add r9, 32
add r10, 32
add r11, 32
add rbx, -8 # add / jne macro-fused on SnB-family CPUs
jne .LBB2_12
So it's 12 fused-domain uops, and can run at best 2 vectors per 3 clocks, bottlenecked on the front-end issue bandwidth of 4 uops per clock.
It's not using aligned loads because the compiler doesn't have that info without inlining into main where the alignment is known, and you didn't guarantee alignment with p = __builtin_assume_aligned(p, 16) or anything in the stand-alone function. Aligned loads (or AVX) would let paddd use a memory operand instead of a separate movdqu load.
The manually-vectorized loop uses aligned loads to save front-end uops, but has more loop overhead from the loop counter.
.LBB1_7: # =>This Inner Loop Header: Depth=1
movdqa xmm0, xmmword ptr [rcx - 16]
paddd xmm0, xmmword ptr [rax - 16]
movdqu xmmword ptr [r11 - 16], xmm0
movdqa xmm0, xmmword ptr [rcx]
paddd xmm0, xmmword ptr [rax]
movdqu xmmword ptr [r11], xmm0
add r10, 2 # separate loop counter
add r11, 32 # 3 pointer incrmeents
add rax, 32
add rcx, 32
cmp r9, r10 # compare the loop counter
jne .LBB1_7
So it's 11 fused-domain uops. It should be running faster than the auto-vectorized loop. Your timing method probably caused the problem.
(Unless mixing loads and stores is actually making it less optimal. The auto-vectorized loop did 4 loads and then 2 stores. Actually that might explain it. Your arrays are a multiple of 4kiB, and might all have the same relative alignment. So you might be getting 4k aliasing here, which means the CPU isn't sure that a store doesn't overlap a load. I think there's a performance counter you can check for that.)
See also Agner Fog's microarch guide (and instruction tables + optimization guide, and other links in the x86 tag wiki, especially Intel's optimization guide.
There's also some good SSE/SIMD beginner stuff in the sse tag wiki.

What is the reason for this weird behavior of sad instruction?

I have implemented a program using SSE2 to compare the vpsadbw instruction and psadbw of AVX2 and SSE2 respectively. The following code is the SSE2 program:
#define MAX1 4096
#define MAX2 MAX1
#define MAX3 MAX1
#define NUM_LOOP 1000000000
double pTime = 0, mTime = 5;
//global data for sequentila matrix operations
unsigned char a_char[MAX1][MAX2] __attribute__(( aligned(16)));
unsigned char b_char[MAX2][MAX3] __attribute__(( aligned(16)));
unsigned char c_char[MAX1][MAX3] __attribute__(( aligned(16)));
unsigned short int temp[8];
int main()
{
int i, j, w=0, sad=0;
struct timespec tStart, tEnd;
double tTotal , tBest=10000;
__m128i vec1, vec2, vecT, sad_total;
sad_total= _mm_setzero_si128();
do{
clock_gettime(CLOCK_MONOTONIC,&tStart);
for(i=0; i<MAX1; i++){
for(j=0; j<MAX2; j+=16){
vec1 = _mm_load_si128((__m128i *)&a_char[i][j]);
vec2 = _mm_load_si128((__m128i *)&b_char[i][j]);
vecT = _mm_sad_epu8( vec1 , vec2);
sad_total = _mm_add_epi64(vecT, sad_total);
}
}
_mm_store_si128((__m128i *)&temp[0], sad_total);
sad=temp[0]+temp[2]+temp[4]+temp[6];
clock_gettime(CLOCK_MONOTONIC,&tEnd);
tTotal = (tEnd.tv_sec - tStart.tv_sec);
tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;
if(tTotal<tBest)
tBest=tTotal;
pTime += tTotal;
} while(w++ < NUM_LOOP && pTime < mTime);
printf(" The best time: %lf sec in %d repetition for %dX result is %d matrix\n",tBest,w, MAX1, sad);
return 0;
}
I use gcc, skylake, Linux mint
When I generate the assembly code the inner loop contain some unwanted move operation as follows for SSE2:
.L26:
vmovdqa xmm1, XMMWORD PTR a_char[rcx+rax]
vpsadbw xmm1, xmm1, XMMWORD PTR b_char[rcx+rax]
add rax, 16
vpaddq xmm3, xmm1, XMMWORD PTR [rsp]
cmp rax, 4096
vmovaps XMMWORD PTR [rsp], xmm3
jne .L26
Since AVX2 generates this assembly code:
.L26:
vmovdqa ymm1, YMMWORD PTR a_char[rcx+rax]
vpsadbw ymm1, ymm1, YMMWORD PTR b_char[rcx+rax]
add rax, 32
vpaddq ymm2, ymm2, ymm1
cmp rax, 4096
jne .L26
I don't know the reason of those 2 move instruction which violated the performance significantly.

The reason is this:
_mm_store_si128((__m128i *)&temp[0], sad_total);
Clang doesn't mind and makes nice code regardless, but GCC didn't like it (failed heuristics perhaps?)
With that replaced to something that doesn't trigger the "this should be on the stack all the time"-heuristic, GCC makes nicer code, for example: (not tested)
__m128i sad_total = _mm_setzero_si128();
for(i = 0; i < MAX1; i++) {
for(j = 0; j < MAX2; j += 16) {
__m128i vec1 = _mm_load_si128((__m128i *)&a_char[i][j]);
__m128i vec2 = _mm_load_si128((__m128i *)&b_char[i][j]);
__m128i vecT = _mm_sad_epu8( vec1 , vec2);
sad_total = _mm_add_epi64(sad_total, vecT);
}
}
__m128i hsum = _mm_add_epi64(sad_total, _mm_bsrli_si128(sad_total, 8));
sad = _mm_cvtsi128_si32(hsum);
The inner loop now looks like
.L2:
vmovdqa xmm1, XMMWORD PTR a_char[rdx+rax]
vpsadbw xmm1, xmm1, XMMWORD PTR b_char[rdx+rax]
add rax, 16
vpaddq xmm2, xmm1, xmm2
cmp rax, 4096
jne .L2

You're directly bypassing the compiler and telling it to use movdqa via _mm_load_si128. It's doing exactly what you're telling it to do. What is the problem here? I also noticed that you're aligning along 16byte boundary, feel free to correct me if I'm wrong (I'm not sure of how attribute is implemented on your compiler) but you may get padding as result so that each element will be aligned on a 16byte boundary; if so this will affect the impact of your unrolling. If not then feel free to correct me.

Is it possible to check if any of 2 sets of 3 ints is equal with less than 9 comparisons?

int eq3(int a, int b, int c, int d, int e, int f){
return a == d || a == e || a == f
|| b == d || b == e || b == f
|| c == d || c == e || c == f;
}
This function receives 6 ints and returns true if any of the 3 first ints is equal to any of the 3 last ints. Is there any bitwise-hack similar way to make it faster?

Assuming you're expecting a high rate of false results you could make a quick "pre-check" to quickly isolate such cases:
If a bit in a is set that isn't set in any of d, e and f then a cannot be equal to any of these.
Thus something like
int pre_eq3(int a, int b, int c, int d, int e, int f){
int const mask = ~(d | e | f);
if ((a & mask) && (b & mask) && (c & mask)) {
return false;
}
return eq3(a, b, c, d, e, f);
}
could speed it up (8 operations instead of 9 17, but much more costly if the result will actually be true). If mask == 0 then of course this won't help.
This can be further improved if with high probability a & b & c has some bits set:
int pre_eq3(int a, int b, int c, int d, int e, int f){
int const mask = ~(d | e | f);
if ((a & b & c) & mask) {
return false;
}
if ((a & mask) && (b & mask) && (c & mask)) {
return false;
}
return eq3(a, b, c, d, e, f);
}
Now if all of a, b and c have bits set where none of d, e and c have any bits set we're out pretty fast.

Expanding on dawg's SSE comparison method, you can combine the results of the comparisons using a vector OR, and move a mask of the compare results back to an integer to test for 0 / non-zero.
Also, you can get data into vectors more efficiently (although it's still pretty clunky to get many separate integers into vectors when they're live in registers to start with, rather than sitting in memory).
You should avoid store-forwarding stalls that result from doing three small stores and one big load.
///// UNTESTED ////////
#include <immintrin.h>
int eq3(int a, int b, int c, int d, int e, int f){
// Use _mm_set to let the compiler worry about getting integers into vectors
// Use -mtune=intel or gcc will make bad code, though :(
__m128i abcc = _mm_set_epi32(0,c,b,a); // args go from high to low position in the vector
// masking off the high bits of the result-mask to avoid false positives
// is cheaper than repeating c (to do the same compare twice)
__m128i dddd = _mm_set1_epi32(d);
__m128i eeee = _mm_set1_epi32(e);
dddd = _mm_cmpeq_epi32(dddd, abcc);
eeee = _mm_cmpeq_epi32(eeee, abcc); // per element: 0(unequal) or -1(equal)
__m128i combined = _mm_or_si128(dddd, eeee);
__m128i ffff = _mm_set1_epi32(f);
ffff = _mm_cmpeq_epi32(ffff, abcc);
combined = _mm_or_si128(combined, ffff);
// results of all the compares are ORed together. All zero only if there were no hits
unsigned equal_mask = _mm_movemask_epi8(combined);
equal_mask &= 0x0fff; // the high 32b element could have false positives
return equal_mask;
// return !!equal_mask if you want to force it to 0 or 1
// the mask tells you whether it was a, b, or c that had a hit
// movmskps would return a mask of just 4 bits, one for each 32b element, but might have a bypass delay on Nehalem.
// actually, pmovmskb apparently runs in the float domain on Nehalem anyway, according to Agner Fog's table >.<
}
This compiles to pretty nice asm, pretty similar between clang and gcc, but clang's -fverbose-asm puts nice comments on the shuffles. Only 19 instructions including the ret, with a decent amount of parallelism from separate dependency chains. With -msse4.1, or -mavx, it saves another couple of instructions. (But probably doesn't run any faster)
With clang, dawg's version is about twice the size. With gcc, something bad happens and it's horrible (over 80 instructions. Looks like a gcc optimization bug, since it looks worse than just a straightforward translation of the source). Even clang's version spends so long getting data into / out of vector regs that it might be faster to just do the comparisons branchlessly and OR the truth values together.
This compiles to decent code:
// 8bit variable doesn't help gcc avoid partial-register stalls even with -mtune=core2 :/
int eq3_scalar(int a, int b, int c, int d, int e, int f){
char retval = (a == d) | (a == e) | (a == f)
| (b == d) | (b == e) | (b == f)
| (c == d) | (c == e) | (c == f);
return retval;
}
Play around with how to get the data from the caller into vector regs.
If the groups of three are coming from memory, then prob. passing pointers so a vector load can get them from their original location is best. Going through integer registers on the way to vectors sucks (higher latency, more uops), but if your data is already live in regs it's a loss to do integer stores and then vector loads. gcc is dumb and follows the AMD optimization guide's recommendation to bounce through memory, even though Agner Fog says he's found that's not worth it even on AMD CPUs. It's definitely worse on Intel, and apparently a wash or maybe still worse on AMD, so it's definitely the wrong choice for -mtune=generic. Anyway...
It's also possible to do 8 of our 9 compares with just two packed-vector compares.
The 9th can be done with an integer compare, and have its truth value ORed with the vector result. On some CPUs (esp. AMD, and maybe Intel Haswell and later) not transferring one of the 6 integers to vector regs at all might be a win. Mixing three integer branchless-compares in with the vector shuffles / compares would interleave them nicely.
These vector comparisons can be set up by using shufps on integer data (since it can combine data from two source registers). That's fine on most CPUs, but requires a lot of annoying casting when using intrinsics instead of actual asm. Even if there is a bypass delay, it's not a bad tradeoff vs. something like punpckldq and then pshufd.
aabb ccab
==== ====
dede deff
c==f
with asm something like:
#### untested
# pretend a is in eax, and so on
movd xmm0, eax
movd xmm1, ebx
movd xmm2, ecx
shl rdx, 32
#mov edi, edi # zero the upper 32 of rdi if needed, or use shld instead of OR if you don't care about AMD CPUs
or rdx, rdi # de in an integer register.
movq xmm3, rdx # de, aka (d<<32)|e
# in 32bit code, use a vector shuffle of some sort to do this in a vector reg, or:
#pinsrd xmm3, edi, 1 # SSE4.1, and 2 uops (same as movd+shuffle)
#movd xmm4, edi # e
movd xmm5, esi # f
shufps xmm0, xmm1, 0 # xmm0=aabb (low dword = a; my notation is backwards from left/right vector-shift perspective)
shufps xmm5, xmm3, 0b01000000 # xmm5 = ffde
punpcklqdq xmm3, xmm3 # broadcast: xmm3=dede
pcmpeqd xmm3, xmm0 # xmm3: aabb == dede
# spread these instructions out between vector instructions, if you aren't branching
xor edx,edx
cmp esi, ecx # c == f
#je .found_match # if there's one of the 9 that's true more often, make it this one. Branch mispredicts suck, though
sete dl
shufps xmm0, xmm2, 0b00001000 # xmm0 = abcc
pcmpeqd xmm0, xmm5 # abcc == ffde
por xmm0, xmm3
pmovmskb eax, xmm0 # will have bits set if cmpeq found any equal elements
or eax, edx # combine vector and scalar compares
jnz .found_match
# or record the result instead of branching on it
setnz dl
This is also 19 instructions (not counting the final jcc / setcc), but one of them is an xor-zeroing idiom, and there are other simple integer instructions. (Shorter encoding, some can run on port6 on Haswell+ which can't handle vector instructions). There might be a longer dep chain due to the chain of shuffles that builds abcc.

If you want a bitwise version look to xor. If you xor two numbers that are the same the answer will be 0. Otherwise, the bits will flip if one is set and the other is not. For example 1000 xor 0100 is 1100.
The code you have will likely cause at least 1 pipeline flush but apart from that it will be ok performance wise.

I think using SSE is probably worth investigating.
It has been 20 years since I wrote any, and not benchmarked, but something like:
#include <xmmintrin.h>
int cmp3(int32_t a, int32_t b, int32_t c, int32_t d, int32_t e, int32_t f){
// returns -1 if any of a,b,c is eq to any of d,e,f
// returns 0 if all a,b,c != d,e,f
int32_t __attribute__ ((aligned(16))) vec1[4];
int32_t __attribute__ ((aligned(16))) vec2[4];
int32_t __attribute__ ((aligned(16))) vec3[4];
int32_t __attribute__ ((aligned(16))) vec4[4];
int32_t __attribute__ ((aligned(16))) r1[4];
int32_t __attribute__ ((aligned(16))) r2[4];
int32_t __attribute__ ((aligned(16))) r3[4];
// fourth word is DNK
vec1[0]=a;
vec1[1]=b;
vec1[2]=c;
vec2[0]=vec2[1]=vec2[2]=d;
vec3[0]=vec3[1]=vec3[2]=e;
vec4[0]=vec4[1]=vec4[2]=f;
__m128i v1 = _mm_load_si128((__m128i *)vec1);
__m128i v2 = _mm_load_si128((__m128i *)vec2);
__m128i v3 = _mm_load_si128((__m128i *)vec3);
__m128i v4 = _mm_load_si128((__m128i *)vec4);
// any(a,b,c) == d?
__m128i vcmp1 = _mm_cmpeq_epi32(v1, v2);
// any(a,b,c) == e?
__m128i vcmp2 = _mm_cmpeq_epi32(v1, v3);
// any(a,b,c) == f?
__m128i vcmp3 = _mm_cmpeq_epi32(v1, v4);
_mm_store_si128((__m128i *)r1, vcmp1);
_mm_store_si128((__m128i *)r2, vcmp2);
_mm_store_si128((__m128i *)r3, vcmp3);
// bit or the first three of each result.
// might be better with SSE mask, but I don't remember how!
return r1[0] | r1[1] | r1[2] |
r2[0] | r2[1] | r2[2] |
r3[0] | r3[1] | r3[2];
}
If done correctly, SSE with no branches should be 4x to 8x faster.

If your compiler/architecture supports vector extensions (like clang and gcc) you can use something like:
#ifdef __SSE2__
#include <immintrin.h>
#elif defined __ARM_NEON
#include <arm_neon.h>
#elif defined __ALTIVEC__
#include <altivec.h>
//#elif ... TODO more architectures
#endif
static int hastrue128(void *x){
#ifdef __SSE2__
return _mm_movemask_epi8(*(__m128i*)x);
#elif defined __ARM_NEON
return vaddlvq_u8(*(uint8x16_t*)x);
#elif defined __ALTIVEC__
typedef __UINT32_TYPE__ v4si __attribute__ ((__vector_size__ (16), aligned(4), __may_alias__));
return vec_any_ne(*(v4si*)x,(v4si){0});
#else
int *y = x;
return y[0]|y[1]|y[2]|y[3];
#endif
}
//if inputs will always be aligned to 16 add an aligned attribute
//otherwise ensure they are at least aligned to 4
int cmp3( int* a , int* b ){
typedef __INT32_TYPE__ i32x4 __attribute__ ((__vector_size__ (16), aligned(4), __may_alias__));
i32x4 x = *(i32x4*)a, cmp, tmp, y0 = y0^y0, y1 = y0, y2 = y0;
//start vectors off at 0 and add the int to each element for optimization
//it adds the int to each element, but since we started it at zero,
//a good compiler (not ICC at -O3) will skip the xor and add and just broadcast/whatever
y0 += b[0];
y1 += b[1];
y2 += b[2];
cmp = x == y0;
tmp = x == y1; //ppc complains if we don't use temps here
cmp |= tmp;
tmp = x ==y2;
cmp |= tmp;
//now hack off then end since we only need 3
cmp &= (i32x4){0xffffffff,0xffffffff,0xffffffff,0};
return hastrue128(&cmp);
}
int cmp4( int* a , int* b ){
typedef __INT32_TYPE__ i32x4 __attribute__ ((__vector_size__ (16), aligned(4), __may_alias__));
i32x4 x = *(i32x4*)a, cmp, tmp, y0 = y0^y0, y1 = y0, y2 = y0, y3 = y0;
y0 += b[0];
y1 += b[1];
y2 += b[2];
y3 += b[3];
cmp = x == y0;
tmp = x == y1; //ppc complains if we don't use temps here
cmp |= tmp;
tmp = x ==y2;
cmp |= tmp;
tmp = x ==y3;
cmp |= tmp;
return hastrue128(&cmp);
}
On arm64 this compiles to the following branchless code:
cmp3:
ldr q2, [x0]
adrp x2, .LC0
ld1r {v1.4s}, [x1]
ldp w0, w1, [x1, 4]
dup v0.4s, w0
cmeq v1.4s, v2.4s, v1.4s
dup v3.4s, w1
ldr q4, [x2, #:lo12:.LC0]
cmeq v0.4s, v2.4s, v0.4s
cmeq v2.4s, v2.4s, v3.4s
orr v0.16b, v1.16b, v0.16b
orr v0.16b, v0.16b, v2.16b
and v0.16b, v0.16b, v4.16b
uaddlv h0,v0.16b
umov w0, v0.h[0]
uxth w0, w0
ret
cmp4:
ldr q2, [x0]
ldp w2, w0, [x1, 4]
dup v0.4s, w2
ld1r {v1.4s}, [x1]
dup v3.4s, w0
ldr w1, [x1, 12]
dup v4.4s, w1
cmeq v1.4s, v2.4s, v1.4s
cmeq v0.4s, v2.4s, v0.4s
cmeq v3.4s, v2.4s, v3.4s
cmeq v2.4s, v2.4s, v4.4s
orr v0.16b, v1.16b, v0.16b
orr v0.16b, v0.16b, v3.16b
orr v0.16b, v0.16b, v2.16b
uaddlv h0,v0.16b
umov w0, v0.h[0]
uxth w0, w0
ret
And on ICC x86_64 -march=skylake it produces the following branchless code:
cmp3:
vmovdqu xmm2, XMMWORD PTR [rdi] #27.24
vpbroadcastd xmm0, DWORD PTR [rsi] #34.17
vpbroadcastd xmm1, DWORD PTR [4+rsi] #35.17
vpcmpeqd xmm5, xmm2, xmm0 #34.17
vpbroadcastd xmm3, DWORD PTR [8+rsi] #37.16
vpcmpeqd xmm4, xmm2, xmm1 #35.17
vpcmpeqd xmm6, xmm2, xmm3 #37.16
vpor xmm7, xmm4, xmm5 #36.5
vpor xmm8, xmm6, xmm7 #38.5
vpand xmm9, xmm8, XMMWORD PTR __$U0.0.0.2[rip] #40.5
vpmovmskb eax, xmm9 #11.12
ret #41.12
cmp4:
vmovdqu xmm3, XMMWORD PTR [rdi] #46.24
vpbroadcastd xmm0, DWORD PTR [rsi] #51.17
vpbroadcastd xmm1, DWORD PTR [4+rsi] #52.17
vpcmpeqd xmm6, xmm3, xmm0 #51.17
vpbroadcastd xmm2, DWORD PTR [8+rsi] #54.16
vpcmpeqd xmm5, xmm3, xmm1 #52.17
vpbroadcastd xmm4, DWORD PTR [12+rsi] #56.16
vpcmpeqd xmm7, xmm3, xmm2 #54.16
vpor xmm8, xmm5, xmm6 #53.5
vpcmpeqd xmm9, xmm3, xmm4 #56.16
vpor xmm10, xmm7, xmm8 #55.5
vpor xmm11, xmm9, xmm10 #57.5
vpmovmskb eax, xmm11 #11.12
ret
And it even works on ppc64 with altivec - though definitely suboptimal
cmp3:
lwa 10,4(4)
lxvd2x 33,0,3
vspltisw 11,-1
lwa 9,8(4)
vspltisw 12,0
xxpermdi 33,33,33,2
lwa 8,0(4)
stw 10,-32(1)
addi 10,1,-80
stw 9,-16(1)
li 9,32
stw 8,-48(1)
lvewx 0,10,9
li 9,48
xxspltw 32,32,3
lvewx 13,10,9
li 9,64
vcmpequw 0,1,0
lvewx 10,10,9
xxsel 32,44,43,32
xxspltw 42,42,3
xxspltw 45,45,3
vcmpequw 13,1,13
vcmpequw 1,1,10
xxsel 45,44,43,45
xxsel 33,44,43,33
xxlor 32,32,45
xxlor 32,32,33
vsldoi 1,12,11,12
xxland 32,32,33
vcmpequw. 0,0,12
mfcr 3,2
rlwinm 3,3,25,1
cntlzw 3,3
srwi 3,3,5
blr
cmp4:
lwa 10,8(4)
lxvd2x 33,0,3
vspltisw 10,-1
lwa 9,12(4)
vspltisw 11,0
xxpermdi 33,33,33,2
lwa 7,0(4)
lwa 8,4(4)
stw 10,-32(1)
addi 10,1,-96
stw 9,-16(1)
li 9,32
stw 7,-64(1)
stw 8,-48(1)
lvewx 0,10,9
li 9,48
xxspltw 32,32,3
lvewx 13,10,9
li 9,64
xxspltw 45,45,3
vcmpequw 13,1,13
xxsel 44,43,42,45
lvewx 13,10,9
li 9,80
vcmpequw 0,1,0
xxspltw 45,45,3
xxsel 32,43,42,32
vcmpequw 13,1,13
xxlor 32,32,44
xxsel 45,43,42,45
lvewx 12,10,9
xxlor 32,32,45
xxspltw 44,44,3
vcmpequw 1,1,12
xxsel 33,43,42,33
xxlor 32,32,33
vcmpequw. 0,0,11
mfcr 3,2
rlwinm 3,3,25,1
cntlzw 3,3
srwi 3,3,5
blr
As you can see from the generated asm, there is still a little room for improvement, but it will compile on risc-v, mips, ppc and other architecture+compiler combinations that support vector extensions.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight