Equivalent SIMD instruction for multiplying specific array elements

Equivalent SIMD instruction for multiplying specific array elements - c

I just understood how to get a dot-product of 2 arrays (as in the following code):
int A[8] = {1,2,3,4,5,1,2,3};
int B[8] = {2,3,4,5,6,2,3,4};
float result = 0;
for (int i = 0; i < 8; i ++) {
result += A[i] * B[i];
}
is equivalent to (in SIMD):
int A[8] = {1,2,3,4,5,1,2,3};
int B[8] = {2,3,4,5,6,2,3,4};
float result = 0;
__m128 r1 = {0,0,0,0};
__m128 r2 = {0,0,0,0};
__m128 r3 = {0,0,0,0};
for (int i = 0; i < 8; i += 4) {
float C[4] = {A[i], A[i+1], A[i+2], A[i+3]};
float D[4] = {B[i], B[i+1], B[i+2], B[i+3]};
__m128 a = _mm_loadu_ps(C);
__m128 b = _mm_loadu_ps(D);
r1 = _mm_mul_ps(a,b);
r2 = _mm_hadd_ps(r1, r1);
r3 = _mm_add_ss(_mm_hadd_ps(r2, r2), r3);
_mm_store_ss(&result, r3);
}
I am curious now how to get the equivalent code in SIMD if I want to multiply elements that aren't consecutive in the array. For example, if I wanted to perform the following, what would be the equivalent in SIMD?
int A[8] = {1,2,3,4,5,1,2,3};
int B[8] = {2,3,4,5,6,2,3,4};
float result = 0;
for (int i = 0; i < 8; i++) {
for (int j = 0; j < 8; j++) {
result += A[foo(i)] * B[foo(j)]
}
}
foo is just some function that returns an int as some function of the input argument.

If I had to do this task, I would do it as follows:
int A[8] = {1,2,3,4,5,1,2,3};
int B[8] = {2,3,4,5,6,2,3,4};
float PA[8], PB[8];
for (int i = 0; i < 8; i++)
{
PA[i] = A[foo(i)];
PB[i] = B[foo(i)];
}
__m128 sums = _mm_set1_ps(0);
for (int i = 0; i < 8; i++)
{
__m128 a = _mm_set1_ps(PA[i]);
for (int j = 0; j < 8; j += 4)
{
__m128 b = _mm_loadu_ps(PB + j);
sums = _mm_add_ps(sums, _mm_mul_ps(a, b));
}
}
float results[4];
_mm_storeu_ps(results, sums);
float result = results[0] + results[1] + results[2] + results[3];

Generally speaking, SIMD does not like such things as random access to individual elements. However, there are still several tricks that can be used.
If the indices provided foo are known at compile time, you can probably shuffle both vectors to align their elements properly. Just look at intrinsics in swizzle category in the Intrinsics Guide. Most certainly you'll need something like _mm_shuffle_ps and _mm_unpackXX_ps. Also various shift/align instructions may be useful.
With AVX2, you can try to use gather instructions. For float type in 32-bit mode you can
use _mm_i32gather_ps or _mm256_i32gather_ps intrinsics. However, #PaulR writes here that they are no faster than trivial scalar loads.
Another solution may be possible with _mm_shuffle_epi8 instrinsic from SSSE3. It is a great instruction that allows to perform in-register gather operation with granularity of individual bytes. However, creating the shuffle mask is not a simple task. This paper (read sections 3.1 and 4) shows how to extend this approach to input arrays larger than one XMM register, but it seems that for 64 and more elements it is no longer better than scalar code.

Related

How to get the number of unique elements of a simd vector in C

Is there a fast way to count the number of unique elements in a simd vector (AVX and any SSE) without converting to array? I want to use it in a specific bruteforcer as an optimization so I want it ideally to be as fast as possible.
Currently Iam doing:
// count the number of unique elements
int uniqueCount(v16n a) {
alignas(16) unsigned char v[16];
_mm_store_si128((v16n*)v, a);
int count = 1;
for(int i = 1; i < 16; i++) {
int j;
for(j = 0; j < i; j++)
if(v[i] == v[j])
break;
if(i == j) count++;
}
return count;
}

Here’s one possible implementation. The code requires SSSE3, SSE 4.1, and slightly benefits from AVX2 when available.
// Count unique bytes in the vector
size_t countUniqueBytes( __m128i vec )
{
size_t result = 1;
// Accumulator for the bytes encountered so far, initialize with broadcasted first byte
#ifdef __AVX2__
__m128i partial = _mm_broadcastb_epi8( vec );
#else
__m128i partial = _mm_shuffle_epi8( vec, _mm_setzero_si128() );
#endif
// Permutation vector to broadcast these bytes
const __m128i one = _mm_set1_epi8( 1 );
__m128i perm = one;
// If you use GCC, uncomment following line and benchmark, may or may not help:
// #pragma GCC unroll 1
for( int i = 1; i < 16; i++ )
{
// Broadcast i-th byte from the source vector
__m128i bc = _mm_shuffle_epi8( vec, perm );
perm = _mm_add_epi8( perm, one );
// Compare bytes with the partial vector
__m128i eq = _mm_cmpeq_epi8( bc, partial );
// Append current byte to the partial vector
partial = _mm_alignr_epi8( bc, partial, 1 );
// Increment result if the byte was not yet in the partial vector
// Compilers are smart enough to do that with `sete` instruction, no branches
int isUnique = _mm_testz_si128( eq, eq );
result += ( isUnique ? (size_t)1 : (size_t)0 );
}
return result;
}

Is there any benefit to this combination of pointers and loops?

I'm working through CUDA C Programming by Cheng, and came across this piece of code:
void sumMatrixOnHost (float *A, float *B, float *C, const int nx, const int ny) {
float *ia = A;
float *ib = B;
float *ic = C;
for (int iy=0; iy<ny; iy++) {
for (int ix=0; ix<nx; ix++) {
ic[ix] = ia[ix] + ib[ix];
}
ia += nx; ib += nx; ic += nx;
}
}
This is for matrix addition whereby the matrices are stored in a row-major format.
As I understand, the inner for loop is iterating over a row and performing element addition, and the outer for loop is then used to increment the pointers to the start of the next row.
Why is this approach better than using pointers over the whole matrix, i.e.
for (int i=0; i<ny*nx; i++) {
ic[i] = ia[i] + ib[i];
}
or dual for loops, i.e.
for (int iy=0; iy<ny; iy++) {
for (int ix=0; ix<nx; ix++) {
ic[iy*nx+ix] = ia[iy*nx+ix] + ib[iy*nx+ix];
}
}
Is this something to do with how it is optimized by the compiler?

The simplest approach, is always the best approach:
for (int i=0; i<ny*nx; i++) {
C[i] = A[i] + B[i];
}
This will be faster than the first solution. The problem with splitting the matrix up by row, is that the vectoriser will do:
process line in batches of 32bytes (size of YMM)
Process the remaining handful of values at the end of the line.
Now repeat for each line!
If however you do it with a single loop, the code generated will be:
process all data in batches of 32bytes (size of YMM)
Process the remaining handful of values at the end of the matrix that don't align to 32byte blocks.
The first version just adds pointless code to process the inner loop. None of that code is needed, it just breaks the ability to vectorise the entire matrix.

The approach on sumMatrixOnHost is better for optimization, and it should execute faster (generally) then the two approach you have suggested.
For the alu multiplication takes more time than addition.
So in sumMatrixOnHost there is no multipicaion, in
for (int i=0; i<ny*nx; i++) {
ic[i] = ia[i] + ib[i];
}
there is multiplication in each iteration of the loop.
in
for (int iy=0; iy<ny; iy++) {
for (int ix=0; ix<nx; ix++) {
ic[iy*nx+ix] = ia[iy*nx+ix] + ib[iy*nx+ix];
}
}
there are 3 multipication in each iteration of the loop.
A simpler approach can be
int n = ny*nx;
for (int i=0; i<n; i++) {
ic[i] = ia[i] + ib[i];
}
but in the last approach we lose another thing that is good in sumMatrixOnHost, and that is the ability to do the operation on matrix blocks and not the whole matrix.

How to optimize the computation of a for loop using SIMD?

I am trying to accelerate a stereo matching algorithm on ODROID XU4 ARM platform using Neon SIMD. For this puropose I am using openMp's
pragmas.
void StereoMatch:: sadCol(uint8_t* leftRank,uint8_t* rightRank,const int SAD_WIDTH,const int SAD_WIDTH_STEP, const int imgWidth,int j, int d , uint16_t* cost)
{
uint16_t sum = 0;
int n = 0;
int m =0;
for ( n = 0; n < SAD_WIDTH+1; n++)
{
#pragma omp simd
for( m = 0; m< SAD_WIDTH_STEP; m = m + imgWidth )
{
sum += abs(leftRank[j+m+n]-rightRank[j+m+n-d]);
};
cost[n] = sum;
sum = 0;
};
I am fairly new to SIMD and openMp, I understood that using the SIMD pragma in the code will direct the compiler to vectorize the subtraction, but when I executed the code I noticed no difference. What should I add to my code in order to vectorize it ?

As said in the comments, ARM-Neon has an instruction which directly does what you want, i.e., compute the absolute difference of unsigned bytes and accumulates it to unsigned short-integers.
Assuming SAD_WIDTH+1==8, here is a very simple implementation using intrinsics (based on the simplified version by #nemequ):
void sadCol(uint8_t* leftRank,
uint8_t* rightRank,
int j,
int d ,
uint16_t* cost) {
const int SAD_WIDTH = 7;
const int imgWidth = 320;
const int SAD_WIDTH_STEP = SAD_WIDTH * imgWidth;
uint16x8_t cost_8 = {0};
for(int m = 0; m < SAD_WIDTH_STEP; m = m + imgWidth ) {
cost_8 = vabal_u8(cost_8, vld1_u8(&leftRank[j+m]), vld1_u8(&rightRank[j+m-d]));
};
vst1q_u16(cost, cost_8);
};
vld1_u8 loads 8 consecutive bytes, vabal_u8 computes the absolute difference and accumulates it to the first register. Finally, vst1q_u16 stores the register to memory.
You can easily make imgWidth and SAD_WIDTH_STEP function parameters. If SAD_WIDTH+1 is a different multiple of 8, you can write another loop for that.
I have no ARM platform at hand to test it, but "it compiles": https://godbolt.org/z/vPqiYI (and the assembly looks fine, in my eyes). If you optimize with -O3 gcc will unroll the loop.

What am I doing with SIMD and pthreads that is slowing my program down?

!!! HOMEWORK - ASSIGNMENT !!!
Please do not post code as I would like to complete myself but rather if possible point me in the right direction with general information or by pointing out mistakes in thought or other possible useful and relevant resources.
I have a method that creates my square npages * npages matrix hat of double for use in my pagerank algorithm.
I have made it with pthreads, SIMD and with both pthreads and SIMD. I have used xcode instruments time profiler and found that the pthreads only version is the fastest, next is the SIMD only version and slowest is the version with both SIMD and pthreads.
As it is homework it can be run on multiple different machines however we were given the header #include so it is to be assumed we can use upto AVX at least. We are given how many threads the program will use as the argument to the program and store it in a global variable g_nthreads.
In my tests I have been testing it on my machine which is an IvyBridge with 4 hardware cores and 8 logical cores and I've been testing it with 4 threads as an arguments and with 8 threads as an argument.
RUNNING TIMES:
SIMD ONLY:
*331ms - for consturct_matrix_hat function *
PTHREADS ONLY (8 threads):
70ms - each thread concurrently
SIMD & PTHREADS (8 threads):
110ms - each thread concurrently
What am I doing that is slowing it down more when using both forms of optimisation?
I will post each implementation:
All versions share these macros:
#define BIG_CHUNK (g_n2/g_nthreads)
#define SMALL_CHUNK (g_npages/g_nthreads)
#define MOD BIG_CHUNK - (BIG_CHUNK % 4)
#define IDX(a, b) ((a * g_npages) + b)
Pthreads:
// struct used for passing arguments
typedef struct {
double* restrict m;
double* restrict m_hat;
int t_id;
char padding[44];
} t_arg_matrix_hat;
// Construct matrix_hat with pthreads
static void* pthread_construct_matrix_hat(void* arg) {
t_arg_matrix_hat* t_arg = (t_arg_matrix_hat*) arg;
// set coordinate limits thread is able to act upon
size_t start = t_arg->t_id * BIG_CHUNK;
size_t end = t_arg->t_id + 1 != g_nthreads ? (t_arg->t_id + 1) * BIG_CHUNK : g_n2;
// Initialise coordinates with given uniform value
for (size_t i = start; i < end; i++) {
t_arg->m_hat[i] = ((g_dampener * t_arg->m[i]) + HAT);
}
return NULL;
}
// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
double* matrix_hat = malloc(sizeof(double) * g_n2);
// create structs to send and retrieve matrix and value from threads
t_arg_matrix_hat t_args[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
t_args[i] = (t_arg_matrix_hat) {
.m = matrix,
.m_hat = matrix_hat,
.t_id = i
};
}
// create threads and send structs with matrix and value to divide the matrix and
// initialise the coordinates with the given value
pthread_t threads[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
pthread_create(threads + i, NULL, pthread_construct_matrix_hat, t_args + i);
}
// join threads after all coordinates have been intialised
for (size_t i = 0; i < g_nthreads; i++) {
pthread_join(threads[i], NULL);
}
return matrix_hat;
}
SIMD:
// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
double* matrix_hat = malloc(sizeof(double) * g_n2);
double dampeners[4] = {g_dampener, g_dampener, g_dampener, g_dampener};
__m256d b = _mm256_loadu_pd(dampeners);
// Use simd to subtract values from each other
for (size_t i = 0; i < g_mod; i += 4) {
__m256d a = _mm256_loadu_pd(matrix + i);
__m256d res = _mm256_mul_pd(a, b);
_mm256_storeu_pd(&matrix_hat[i], res);
}
// Subtract values from each other that weren't included in simd
for (size_t i = g_mod; i < g_n2; i++) {
matrix_hat[i] = g_dampener * matrix[i];
}
double hats[4] = {HAT, HAT, HAT, HAT};
b = _mm256_loadu_pd(hats);
// Use simd to raise each value to the power 2
for (size_t i = 0; i < g_mod; i += 4) {
__m256d a = _mm256_loadu_pd(matrix_hat + i);
__m256d res = _mm256_add_pd(a, b);
_mm256_storeu_pd(&matrix_hat[i], res);
}
// Raise each value to the power 2 that wasn't included in simd
for (size_t i = g_mod; i < g_n2; i++) {
matrix_hat[i] += HAT;
}
return matrix_hat;
}
Pthreads & SIMD:
// struct used for passing arguments
typedef struct {
double* restrict m;
double* restrict m_hat;
int t_id;
char padding[44];
} t_arg_matrix_hat;
// Construct matrix_hat with pthreads
static void* pthread_construct_matrix_hat(void* arg) {
t_arg_matrix_hat* t_arg = (t_arg_matrix_hat*) arg;
// set coordinate limits thread is able to act upon
size_t start = t_arg->t_id * BIG_CHUNK;
size_t end = t_arg->t_id + 1 != g_nthreads ? (t_arg->t_id + 1) * BIG_CHUNK : g_n2;
size_t leftovers = start + MOD;
__m256d b1 = _mm256_loadu_pd(dampeners);
//
for (size_t i = start; i < leftovers; i += 4) {
__m256d a1 = _mm256_loadu_pd(t_arg->m + i);
__m256d r1 = _mm256_mul_pd(a1, b1);
_mm256_storeu_pd(&t_arg->m_hat[i], r1);
}
//
for (size_t i = leftovers; i < end; i++) {
t_arg->m_hat[i] = dampeners[0] * t_arg->m[i];
}
__m256d b2 = _mm256_loadu_pd(hats);
//
for (size_t i = start; i < leftovers; i += 4) {
__m256d a2 = _mm256_loadu_pd(t_arg->m_hat + i);
__m256d r2 = _mm256_add_pd(a2, b2);
_mm256_storeu_pd(&t_arg->m_hat[i], r2);
}
//
for (size_t i = leftovers; i < end; i++) {
t_arg->m_hat[i] += hats[0];
}
return NULL;
}
// Construct matrix_hat
double* construct_matrix_hat(double* matrix) {
double* matrix_hat = malloc(sizeof(double) * g_n2);
// create structs to send and retrieve matrix and value from threads
t_arg_matrix_hat t_args[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
t_args[i] = (t_arg_matrix_hat) {
.m = matrix,
.m_hat = matrix_hat,
.t_id = i
};
}
// create threads and send structs with matrix and value to divide the matrix and
// initialise the coordinates with the given value
pthread_t threads[g_nthreads];
for (size_t i = 0; i < g_nthreads; i++) {
pthread_create(threads + i, NULL, pthread_construct_matrix_hat, t_args + i);
}
// join threads after all coordinates have been intialised
for (size_t i = 0; i < g_nthreads; i++) {
pthread_join(threads[i], NULL);
}
return matrix_hat;
}

I think it's because your SIMD code is horribly inefficient: It loops over the memory twice, instead of doing the add with the multiply, before storing. You didn't test SIMD vs. a scalar baseline, but if you had you'd probably find that your SIMD code wasn't a speedup with a single thread either.
STOP READING HERE if you want to solve the rest of your homework yourself.
If you used gcc -O3 -march=ivybridge, the simple scalar loop in the pthread version probably auto-vectorized into something like what you should have done with intrinsics. You even used restrict, so it might realize that the pointers can't overlap with each other, or with g_dampener.
// this probably autovectorizes well.
// Initialise coordinates with given uniform value
for (size_t i = start; i < end; i++) {
t_arg->m_hat[i] = ((g_dampener * t_arg->m[i]) + HAT);
}
// but this would be even safer to help the compiler's aliasing analysis:
double dampener = g_dampener; // in case the compiler things one of the pointers might point at the global
double *restrict hat = t_arg->hat;
const double *restrict mat = t_arg->m;
... same loop but using these locals instead of
It's probably not a problem for an FP loop, since double definitely can't alias with double *.
The coding style is also pretty nasty. You should give meaningful names to your __m256d variables whenever possible.
Also, you use malloc, which doesn't guarantee that matrix_hat will be aligned to a 32B boundary. C11's aligned_alloc is probably the nicest way, vs. posix_memalign (clunky interface), _mm_malloc (have to free with _mm_free, not free(3)), or other options.
double* construct_matrix_hat(const double* matrix) {
// double* matrix_hat = malloc(sizeof(double) * g_n2);
double* matrix_hat = aligned_alloc(64, sizeof(double) * g_n2);
// double dampeners[4] = {g_dampener, g_dampener, g_dampener, g_dampener}; // This idiom is terrible, and might actually compile to code that stores it 4 times on the stack and then loads.
__m256d vdamp = _mm256_set1_pd(g_dampener); // will compile to a broadcast-load (vbroadcastsd)
__m256d vhat = _mm256_set1_pd(HAT);
size_t last_full_vector = g_n2 & ~3ULL; // don't load this from a global.
// it's better for the compiler to see how it's calculated from g_n2
// ??? Use simd to subtract values from each other // huh? this is a multiply, not a subtract. Also, everyone can see it's using SIMD, that part adds no new information
// if you really want to manually vectorize this, instead of using an OpenMP pragma or -O3 on the scalar loop, then:
for (size_t i = 0; i < last_full_vector; i += 4) {
__m256d vmat = _mm256_loadu_pd(matrix + i);
__m256d vmul = _mm256_mul_pd(vmat, vdamp);
__m256d vres = _mm256_add_pd(vmul, vhat);
_mm256_store_pd(&matrix_hat[i], vres); // aligned store. Doesn't matter for performance.
}
#if 0
// Scalar cleanup
for (size_t i = last_vector; i < g_n2; i++) {
matrix_hat[i] = g_dampener * matrix[i] + HAT;
}
#else
// assume that g_n2 >= 4, and do a potentially-overlapping unaligned vector
if (last_full_vector != g_n2) {
// Or have this always run, and have the main loop stop one element sooner (so this overlaps by 0..3 instead of by 1..3 with a conditional)
assert(g_n2 >= 4);
__m256d vmat = _mm256_loadu_pd(matrix + g_n2 - 4);
__m256d vmul = _mm256_mul_pd(vmat, vdamp);
__m256d vres = _mm256_add_pd(vmul, vhat);
_mm256_storeu_pd(&matrix_hat[g_n2-4], vres);
}
#endif
return matrix_hat;
}
This version compiles (after defining a couple globals) to the asm we expect. BTW, normal people pass sizes around as function arguments. This is another way of avoiding optimization-failure due to C aliasing rules.
Anyway, really your best bet is to let OpenMP auto-vectorize it, because then you don't have to write a cleanup loop yourself. There's nothing tricky about the data organization, so it vectorizes trivially. (And it's not a reduction, like in your other question, so there's no loop-carried dependency or order-of-operations concern).

Converting an int array to an int value in c

I have an integer array
int number[] = {1,2,3,4};
What can I do to get int x = 1234?
I need to have a c version of it.

x = 1000*number[0] + 100*number[1] + 10*number[2] + number[3];
This is basically how decimal numbers work. A more general version (when you don't know how long 'number' is) would be:
int x = 0;
int base = 10;
for(int ii = 0; ii < sizeof(number); ii++) x = base*x + number[ii];
Note - if base is something other than 10, the above code will still work. Of course, if you printed out x with the usual cout<<x, you would get a confusing answer. But it might serve you at some other time. Of course you would really want to check that number[ii] is between 0 and 9, inclusive - but that's pretty much implied by your question. Still - good programming requires checking, checking, and checking. I'm sure you can add that bit yourself, though.

You can think of how to "shift over" a number to the left by multiplying by ten. You can think of appending a digit by adding after a shift.
So you effectively end up with a loop where you do total *= 10 and then total += number[i]
Of course this only works if your array is digits, if it is characters you'll want to do number[i] - '0' and if it is in a different base you'll want to multiply by a different number (8 for instance if it is octal).

int i = 0, x = 0;
for(; i < arrSize; i++)
x = x * 10 + number[i];
x is the result.

int i;
int x = 0;
for ( i = 0; i < 4; i++ )
x = ( 10 * x + number[i] );

int number[]={1,2,3,4}
int x=0,temp;
temp=10;
for(i=0;i<number.length;i++)
{
x=x*temp+number[i];
}
cout>>x;

You could do something with a for loop and powers of 10
int tens = 1;
int final = 0;
for (int i = arrSize - 1; i <= 0; ++i)
{
final += tens*number[i];
tens*=10;
}
return final;

Answer is quite easy.Just list a complete function here.
int toNumber(int number[],arraySize)
{
int i;
int value = 0;
for(i = 0;i < arraySize;i++)
{
value *=10;
value += number[i];
}
return value;
}

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Equivalent SIMD instruction for multiplying specific array elements - c

Related

How to get the number of unique elements of a simd vector in C

Is there any benefit to this combination of pointers and loops?

How to optimize the computation of a for loop using SIMD?

What am I doing with SIMD and pthreads that is slowing my program down?

Converting an int array to an int value in c

Categories

Resources