Using ARM NEON is slower in a simple Addition task - arm

I tried to compose a simple piece of NEON code but found it is slower than regular C++ implementation. The code is as below
float A[] = {1,2,3,4};
float B[] = {2,3,4,5};
float32x4_t v1;
float32x4_t v2;
int counter = 1000000;
while(counter--){
v1 = vld1q_f32(A);
v2 = vld1q_f32(B);
v = vaddq_f32(v1,v2);
vst1q_f32(A,v);
// A[0] = A[0]+B[0]; // regular implementation
// A[1] = A[1]+B[1]; // regular implementation
// A[2] = A[2]+B[2]; // regular implementation
// A[3] = A[3]+B[3]; // regular implementation
}
I searched for the reason so I guess it is because of the in-order pipeline and this simple task cause stall in CPU? But could anyone help explain more in detail? And is there any way to improve the performance of this NEON implementation? Or is it better to use the regular implementation than using NEON when facing this kind of simple task? Thank you.

Your test routine is completely flawed to start with:
Since all the inputs are clearly visible to the compiler at build time, the compiler will simply generate machine codes similar to the one below:
A[0] = 3.0f;
A[1] = 5.0f;
A[2] = 7.0f;
A[3] = 9.0f;
In order to prevent compilers from this kind of cheating, you have to hide the inputs:
void myFunc_c(float *pA, float *pB, uint32_t count)
{
if (count == 0) return;
do {
*pA++ += *pB++;
} while (--count);
}
void myFunc_neon(float *pA, float *pB, uint32_t count)
{
float32x4_t a, b;
count >>= 2;
if (count == 0) return;
do {
a = vld1q_f32(pA);
b = vld1q_f32(pB);
a = vaddq_f32(a, b);
vst1q_f32(pA, a);
pA += 4;
pB += 4;
} while (--count);
}
All you need to do is to allocate enough memory for pA and pB, initialize them if you want, and call the functions above.
I think the neon version will be roughly 3 times faster.

If you use intrinsics for load/store the compiler seems to leave them alone.
Without intrinsics it can optimize away the load/store and use just registers for intermediate values. Have a look at the generated assembly: https://www.godbolt.org/z/FvqKP3
In addition, I would guess loading something into NEON while the same memory is just being written by another transfer could be a such a rare use-case that the CPU designers wouldn't have bothered to implement a direct bypass. You may have to wait until the store is fully finished before the load can start.

Related

How to add all int32 element in a lane using neon intrinsic

Here is my code for adding all int16x4 element in a lane:
#include <arm_neon.h>
...
int16x4_t acc = vdup_n_s16(1);
int32x2_t acc1;
int64x1_t acc2;
int32_t sum;
acc1 = vpaddl_s16(acc);
acc2 = vpaddl_s32(acc1);
sum = (int)vget_lane_s64(acc2, 0);
printf("%d\n", sum);// 4
And I tried to add all int32x4 element in a lane.
but my code looks inefficient:
#include <arm_neon.h>
...
int32x4_t accl = vdupq_n_s32(1);
int64x2_t accl_1;
int64_t temp;
int64_t temp2;
int32_t sum1;
accl_1=vpaddlq_s32(accl);
temp = (int)vgetq_lane_s64(accl_1,0);
temp2 = (int)vgetq_lane_s64(accl_1,1);
sum1=temp+temp2;
printf("%d\n", sum);// 4
Is there simply and clearly way to do this? I hope the LLVM assembly code is simply and clearly after compile it. and I also hope the final type of sum is 32 bits.
I used ellcc cross-compiler base on LLVM compiler infrastructure to compile it.
I saw the similar question(Add all elements in a lane) on stackoverflow, but the intrinsic addv doesn't work on my host.
If you only want a 32-bit result, presumably either intermediate overflow is unlikely or you simply don't care about it, in which case you could just stay 32-bit all the way:
int32x2_t temp = vadd_s32(vget_high_s32(accl), vget_low_s32(accl));
int32x2_t temp2 = vpadd_s32(temp, temp);
int32_t sum1 = vget_lane_s32(temp2, 0);
However, using 64-bit accumulation isn't actually any more hassle, and can also be done without dropping out of NEON - it's just a different order of operations:
int64x2_t temp = vpaddlq_s32(accl);
int64x1_t temp2 = vadd_s64(vget_high_s64(temp), vget_low_s64(temp));
int32_t sum1 = vget_lane_s32(temp2, 0);
Either of those boils down to just 3 NEON instructions and no scalar arithmetic. The crucial trick on 32-bit ARM is that a pairwise add of two halves of a Q register is simply a normal add of two D registers - that doesn't apply to AArch64 where the SIMD register layout is different, but then AArch64 has the aforementioned horizontal addv anyway.
Now, how horrible any of that looks in LLVM IR I don't know - I suppose it depends on how it treats vector types and operations internally - but in terms of the final ARM machine code both could be considered optimal.

SSE parallelization

Hi I am trying to improve the performance of this code, suposing that I have a machine capable of handling 4 threads. I first thought about making omp parallel but then I saw that this function was inside a for loop so creating threads so many times was not very efficient. So i would like to know how to implement it with SSE that would be more efficient:
unsigned char cubicInterpolate_paralelo(unsigned char p[4], unsigned char x) {
unsigned char resultado;
unsigned char intermedio;
intermedio = + x*(3.0*(p[1] - p[2]) + p[3] - p[0]);
resultado = p[1] + 0.5 * x *(p[2] - p[0] + x*(2.0*p[0] - 5.0*p[1] + 4.0*p[2] - p[3] + x*(3.0*(p[1] - p[2]) + p[3] - p[0])));
return resultado;
}
unsigned char bicubicInterpolate_paralelo (unsigned char p[4][4], unsigned char x, unsigned char y) {
unsigned char arr[4],valorPixelCanal;
arr[0] = cubicInterpolate_paralelo(p[0], y);
arr[1] = cubicInterpolate_paralelo(p[1], y);
arr[2] = cubicInterpolate_paralelo(p[2], y);
arr[3] = cubicInterpolate_paralelo(p[3], y);
valorPixelCanal = cubicInterpolate_paralelo(arr, x);
return valorPixelCanal;
}
this is used inside some nested for:
for(i=0; i<z_img.width(); i++) {
for(j=0; j<z_img.height(); j++) {
//For R,G,B
for(c=0; c<3; c++) {
for(l=0; l<4; l++){
for(k=0; k<4; k++){
arr[l][k] = img(i/zFactor +l, j/zFactor +k, 0, c);
}
}
color[c] = bicubicInterpolate_paralelo(arr, (unsigned char)(i%zFactor)/zFactor, (unsigned char)(j%zFactor)/zFactor);
}
z_img.draw_point(i,j,color);
}
}
I've taken some liberties with the code, so you may have to change it significantly, but here's an (untested) transliteration to SSE:
__m128i x = _mm_unpacklo_epi8(_mm_loadl_epi64(x_array), _mm_setzero_si128());
__m128i p0 = _mm_unpacklo_epi8(_mm_loadl_epi64(p0_array), _mm_setzero_si128());
__m128i p1 = _mm_unpacklo_epi8(_mm_loadl_epi64(p1_array), _mm_setzero_si128());
__m128i p2 = _mm_unpacklo_epi8(_mm_loadl_epi64(p2_array), _mm_setzero_si128());
__m128i p3 = _mm_unpacklo_epi8(_mm_loadl_epi64(p3_array), _mm_setzero_si128());
__m128i t = _mm_sub_epi16(p1, p2);
t = _mm_add_epi16(_mm_add_epi16(t, t), t); // 3 * (p[1] - p[2])
__m128i intermedio = _mm_mullo_epi16(x, _mm_sub_epi16(_mm_add_epi16(t, p3), p0));
t = _mm_add_epi16(p1, _mm_slli_epi16(p1, 2)); // 5 * p[1]
// t2 = 2 * p[0] + 4 * p[2]
__m128i t2 = _mm_add_epi16(_mm_add_epi16(p0, p0), _mm_slli_epi16(p2, 2));
t = _mm_mullo_epi16(x, _mm_sub_epi16(_mm_add_epi16(t2, intermedio), _mm_add_epi16(t, p3)));
t = _mm_mullo_epi16(x, _mm_add_epi16(_mm_sub_epi16(p2, p0), t));
__m128i resultado = _mm_add_epi16(p1, _mm_srli_epi16(t, 1));
return resultado;
The 16 bit intermediates that I use should be wide enough, the only way for information from the high bits to affect low bits in this code is the right shift by 1 (0.5 * in your code), so really we only need 9 bits, the rest cannot affect the result. Bytes wouldn't be wide enough (unless you have some extra guarantees that I don't know about), but they would be annoying anyway because there is no nice way to multiply them.
I pretended for simplicity that the input takes the form of contiguous arrays of x's, p[0]'s etc, that's not what you need here but I didn't have time to work out all the loading and shuffling.
SSE is quite unrelated to threads. A single thread executes a single instruction at a time; with SSE that single instruction may apply to 4 or 8 sets of arguments at a time. So with multiple threads you can also run multiple SSE instructions to process even more data.
You can use threads with for-loops. Just don't use them inside. Instead, take the for(i=0; i<z_img.width(); i++) { outer loop and split it in 4 bands of width/4. Thread 0 gets 0..width/4, thread 1 gets width/4..width/2 etc.
On an unrelated note your code also suffers from mixing floating-point and integer math. 0.5 * x is not nearly as efficient as x/2.
Using OpenMP, you could try adding the #pragma to the outer-most for loop. This should solve your problem.
Going the SSE route is trickier because of the extra alignment restrictions on data, but the easiest transform would be to extend cubicInterpolate_paralelo to handle multiple calculations at once. With enough luck, telling the compiler to use SSE will do the trick for you, but to make sure, you could use intrinsic functions and types.

SSE SIMD Segmentation Fault when using resulting float

I'm trying to use Intel Intrinsics to perform an operation quickly on a float array. The operations themselves seem to work fine; however, when I try to get the result of the operation into a standard C variable I get a SEGFAULT. If I comment the indicated line below out, the program runs. If I save the result of the indicated line, but do not manipulate it in any way, the program runs fine. It is only when I try to (in any way) interact with the result of _mm_cvtss_f32(C) that my program crashes. Any ideas?
float proc(float *a, float *b, int n, int c, int width) {
// Operation: SUM: (A - B) ^ 2
__m128 A, B, C;
float total = 0;
for (int d = 0, k = 0; k < c; d += width, k++) {
for (int i = 0; i < n / 4 * 4; i += 4) {
A = _mm_load_ps(&a[i + d]);
B = _mm_load_ps(&b[i + d]);
C = _mm_sub_ps(A, B);
C = _mm_mul_ps(C, C);
C = _mm_hadd_ps(C, C);
C = _mm_hadd_ps(C, C);
total += _mm_cvtss_f32(C); // SEGFAULT HERE
}
for (int i = n / 4 * 4; i < n; i++) {
int diff = a[i + d] - b[i + d];
total += diff * diff;
}
}
return total;
}
Are you sure your program actually crashes at the instruction you cited, or is the compiler just optimizing the rest of the loop away if you remove the _mm_cvtss_f32() line (it doesn't have any other visible side effects)? Potential failure causes would be improper alignment of the a and b arrays since you are using aligned load instructions. Are you sure they are 16-byte aligned? On contemporary Intel hardware, there is very little performance difference between 16-byte aligned and unaligned loads (see the comments on the question above for a discussion of the issue).
I mentioned in my original comment that movaps has a shorter encoding than movups. This is not correct. I was thinking instead of movaps versus movapd, which do the same memory transfer, only they're labeled as being for single-precision and double-precision data, respectively. In practice, they do the same thing, but movaps has a shorter encoding.

How to optimize C code : looking for the next set bit and finding sum of corresponding array elements

EDIT: Now I realize I didn't explain my algorithm well enough. I'll try again.
What I'm doing is something very similar to dot product of two vectors, but there is a difference. I've got two vectors: one vector of bits and one vector of floats of the same length. So I need to calculate sum:
float[0]*bit[0]+float[1]*bit[1]+..+float[N-1]*bit[N-1], BUT the difference from a classic dot product is that I need to skip some fixed number of elements after each set bit.
Example:
vector of floats = {1.5, 2.0, 3.0, 4.5, 1.0}
vector of bits = {1, 0, 1, 0, 1 }
nSkip = 2
in this case sum is calculated as follows:
sum = floats[0]*bits[0]
bits[0] == 1, so skipping 2 elements (at positions 1 and 2)
sum = sum + floats[3]*bits[3]
bits[3] == 0, so no skipping
sum = sum + floats[4]*bits[4]
result = 1.5*1+4.5*0+1.0*1 = 2.5
The following code is called many times with different data so I need to optimize it to run as fast as possible on my Core i7 (I don't care much about compatibility with anything else). It is optimized to some extent but still slow, but I don't know how to further improve it.
Bit array is implemented as an array of 64 bit unsigned ints, it allows me to use bitscanforward to find the next set bit.
code:
unsigned int i = 0;
float fSum = 0;
do
{
unsigned int nAddr = i / 64;
unsigned int nShift = i & 63;
unsigned __int64 v = bitarray[nAddr] >> nShift;
unsigned long idx;
if (!_BitScanForward64(&idx, v))
{
i+=64-nShift;
continue;
}
i+= idx;
fSum += floatarray[i];
i+= nSkip;
} while(i<nEnd);
Profiler shows 3 slowest hotspots :
1. v = bitarray[nAddr] >> nShift (memory access with shift)
2. _BitScanForward64(&idx, v)
3. fSum += floatarray[i]; (memory access)
But probably there is a different way of doing this. I was thinking about just resetting nSkip bits after each set bit in the bit vector and then calculating classical dot product - didn't try yet but honestly don't belive it will be faster with more memory access.
You have too many of your operations inside of the loop. You also only have one loop, so many of the operations that do need to happen for each flag word (the 64 bit unsigned integer) are happening 63 extra times.
Consider division an expensive operation and try to not do that too often when optimizing code for performance.
Memory access is also considered expensive in terms of how long it takes, so this should also be limited to required accesses only.
Tests that allow you to exit early are often useful (though sometimes the test itself is expensive relative to the operations you'd be avoiding, but that's probably not the case here.
Using nested loops should simplify this a lot. The outer loop should work at the 64 bit word level, and the inner loop should work at the bit level.
I have noticed a mistake in my earlier recommendations. Since the division here is by 64, which is a power of 2, this is not actually an expensive operation, but we still need to get as many operations as far out of the loops as we can.
/* this is completely untested, but incorporates the optimizations
that I outlined as well as a few others.
I process the arrays backwards, which allows for elimination of
comparisons of variables against other variables, which is much
slower than comparisons of variables against 0, which is essentially
free on many processors when you have just operated or loaded the
value to a register.
Going backwards at the bit level also allows for the possibility that
the compiler will take advantage of the comparison of the top bit
being the same as test for negative, which is cheap and mostly free
for all but the first time through the inner loop (for each time
through the outer loop.
*/
double acc = 0.0;
unsigned i_end = nEnd-1;
unsigned i_bit;
int i_word_end;
if (i_end == 0)
{
return acc;
}
i_bit = i_end % 64;
i_word = i_end / 64;
do
{
unsigned __int64 v = bitarray[i_word_end];
unsigned i_upper = i_word_end << 64;
while (v)
{
if (v & 0x80000000000000)
{
// The following code is semantically the same as
// unsigned i = i_bit_end + (i_word_end * sizeof(v));
unsigned i = i_bit_end | i_upper;
acc += floatarray[i];
}
v <<= 1;
i--;
}
i_bit_end = 63;
i_word_end--;
} while (i_word_end >= 0);
I think you should check "how to ask questions" first. You will not gain many upvotes for this, since you are asking us to do the work for you instead of introducing a particular problem.
I cannot see why you are incrementing the same variable in two places instead of one (i).
Also think you should declare variables only once, not in every iteration.

Why addition using bitwise operators in this code very slower than arithmetic addition

I tried comparing arithmetic addition with a function using bitwise operations i wrote - to find that the latter was almost 10x slower. What could be the reason for such disparity in speed? Since i am adding to the same number in the loop, does the compiler rewrite it to something more optimal in the first case?
Using arithmetic operation:
int main()
{
clock_t begin = clock();
int x;
int i = 1000000000;
while(i--) {
x = 1147483000 + i;
}
printf("%d\n", x);
clock_t end = clock();
double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("time spent = %f\n", time_spent);
return 0;
}
Output:
1147483000
time spent = 3.520000
Using bitwise operators:
The line inside while loop was replaced with:
x = add(1147483000, i);
and here's the add function:
int add(int x, int y) {
while(y != 0) {
int carry = (x & y);
x = x ^ y;
y = carry << 1;
}
return x;
}
Output:
1147483000
time spent = 32.940000
Integer arithmetic is performed in hardware typically in a very small number of clock cycles.
You will not be able to get close to this performance in software. Your implementation using bitwise operations involves a function call and a loop. The bitwise operations that you perform typically cost similar numbers of clock cycles as arithmetic.
You are performing three bitwise operations per iteration. Frankly, I'm astonished that there is only a factor of 10 here.
I also wonder what your compiler settings are, specifically any optimizations. A good compiler could eliminate your while loop in the arithmetic version. For performance comparisons you should be comparing optimised code. It looks as if you might not be doing so.
It's difficult to know what you are trying to achieve here, but do not expect to beat the performance of hardware arithmetic units.
You have replaced this:
x = 1147483000 + i;
with this:
while(y != 0) {
int carry = (x & y);
x = x ^ y;
y = carry << 1;
}
Of course you get a huge slow-down! + of two integers is one assembly instruction. Your while loop executes many instructions, effectively simulating in software what the hardware does when it executes an addition.
To elaborate more, this is how a full adder looks like. With 32-bit addition, the ALU contains 32 of this unit cascaded. Each of those hardware elements have very very small delay. The delay of the wires is negligible. So if the software adds two 32-bit numbers together, it takes very very little time.
On the other hand, if you try to simulate the addition by hand, you make the CPU go to memory, fetch and decode some instructions 32 times which takes considerably longer time.
When you replace addition with a function call:
calling the function is more time intensive than simple addition because function call is associated with stack operations.
In the function you are replacing addition with three bitwise operations - how fast they are compared to addition may be an issue- though not confirm about that without testing. Can you post the individual times for the three bitwise operations here?:
1.
//tic
while(i--) {
int carry = (x & y);
}
//toc
2.
//tic
while(i--) {
x = x ^ y;
}
//toc
3.
//tic
while(i--) {
y = carry << 1;
}
//toc
But function call should be the main reason.

Resources