Openmp simd (increment vector) - c

I have tried to apply #pragma omp simd to the following code (loops) but it does not seem to work (no speed improvement). I also tried #pragma omp simd linear but all my attempts resulted in a seg fault.
https://github.com/Rdatatable/data.table/blob/master/src/fsort.c#L209
https://github.com/Rdatatable/data.table/blob/master/src/fsort.c#L184
Is it even possible to increment a vector with simd? Example:
#include <stdio.h>
#include <stdlib.h>
int main() {
int len = 1000;
int tmp[len];
for(int i=0; i<len; ++i) {
tmp[i]=rand()%100;
}
int *thisCounts = (int *) calloc(len, sizeof(int));
for (int j=0; j<len; ++j) {
thisCounts[tmp[j]]++;
}
for (int j=0; j<len; ++j) {
printf("%d, ",thisCounts[j]);
}
free(thisCounts);
return 0;
}
FYI, line 209 is the one that takes most time and I am trying to improve.
Thank you

It depends of the target hardware architecture. Many processor architectures does not have SIMD instruction performing such kind of indirect accesses. On mainstream x86-64 processors, there is a scatter/gather instruction to perform such a computation. However, they are not efficiently implemented and thus not significantly faster than using non-SIMD instructions. Moreover, using them is difficult here since there is possibly some increment conflicts (if tmp[j1] == tmp[j2] with j1 != j2. The AVX-512 SIMD instruction set contains interesting instructions for that but it is only available on few recent processors. The same apply for ARM with SVE/SVE2 which is very new and not yet available on the vast majority of ARM processors.
Thus, put it shortly, there is very slight chance your processor can possibly do that using SIMD instructions, but it does not means it is not possible on all architecture. Note also that using #pragma omp simd is likely not correct here because of possible conflicts. Note also that the speed of this operation is likely dependent of the input data on a lot of modern processors (random data do not behave like most real-world possible inputs).

Related

SSE intrinsics without compiler optimization

I am new to SSE intrinsics and try to optimise my code by it. Here is my program about counting array elements which are equal to the given value.
I changed my code to SSE version but the speed almost doesn't change. I am wondering whether I use SSE in a wrong way...
This code is for an assignment where we're not allowed to enable compiler optimization options.
No SSE version:
int get_freq(const float* matrix, float value) {
int freq = 0;
for (ssize_t i = start; i < end; i++) {
if (fabsf(matrix[i] - value) <= FLT_EPSILON) {
freq++;
}
}
return freq;
}
SSE version:
#include <immintrin.h>
#include <math.h>
#include <float.h>
#define GETLOAD(n) __m128 load##n = _mm_load_ps(&matrix[i + 4 * n])
#define GETEQU(n) __m128 check##n = _mm_and_ps(_mm_cmpeq_ps(load##n, value), and_value)
#define GETCOUNT(n) count = _mm_add_ps(count, check##n)
int get_freq(const float* matrix, float givenValue, ssize_t g_elements) {
int freq = 0;
int i;
__m128 value = _mm_set1_ps(givenValue);
__m128 count = _mm_setzero_ps();
__m128 and_value = _mm_set1_ps(0x00000001);
for (i = 0; i + 15 < g_elements; i += 16) {
GETLOAD(0); GETLOAD(1); GETLOAD(2); GETLOAD(3);
GETEQU(0); GETEQU(1); GETEQU(2); GETEQU(3);
GETCOUNT(0);GETCOUNT(1);GETCOUNT(2);GETCOUNT(3);
}
__m128 shuffle_a = _mm_shuffle_ps(count, count, _MM_SHUFFLE(1, 0, 3, 2));
count = _mm_add_ps(count, shuffle_a);
__m128 shuffle_b = _mm_shuffle_ps(count, count, _MM_SHUFFLE(2, 3, 0, 1));
count = _mm_add_ps(count, shuffle_b);
freq = _mm_cvtss_si32(count);
for (; i < g_elements; i++) {
if (fabsf(matrix[i] - givenValue) <= FLT_EPSILON) {
freq++;
}
}
return freq;
}
If you need to compile with -O0, then do as much as possible in a single statement. In normal code, int a=foo(); bar(a); will compile to the same asm as bar(foo()), but in -O0 code, the second version will probably be faster, because it doesn't store the result to memory and then reload it for the next statement.
-O0 is designed to give the most predictable results from debugging, which is why everything is stored to memory after every statement. This is obviously horrible for performance.
I wrote a big answer a while ago for a different question from someone else with a stupid assignment like yours that required them to optimize for -O0. Some of that may help.
Don't try too hard on this assignment. Probably most of the "tricks" that you figure out that make your code run faster with -O0 will only matter for -O0, but make no difference with optimization enabled.
In real life, code is typically compiled with clang or gcc -O2 at least, and sometimes -O3 -march=haswell or whatever to auto-vectorize. (Once it's debugged and you're ready to optimize.)
Re: your update:
Now it compiles, and the horrible asm from the SSE version can be seen. I put it on godbolt along with a version of the scalar code that actually compiles, too. Intrinsics usually compile very badly with optimization disabled, with the inline functions still having args and return values that result in actual load/store round trips (store-forwarding latency) even with __attribute__((always_inline)). See Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled for example.
The scalar version comes out a lot less bad. Its source does everything in one expression, so temporaries stay in registers. The loop counter is still in memory, though, bottlenecking it to at best one iteration per 6 cycles on Haswell, for example. (See the x86 tag wiki for optimization resources.)
BTW, a vectorized fabsf() is easy, see Fastest way to compute absolute value using SSE. That and an SSE compare for less-than should do the trick to give you the same semantics as your scalar code. (But makes it even harder to get -O0 to not suck).
You might do better just manually unrolling your scalar version one or two times, because -O0 sucks too much.
Some compilers are pretty good about doing optimization of vectors. Did you check the generated assembly of optimized build of both versions? Isn't the "naive" version actually using SIMD or other optimization techniques?

How to avoid fork-join when calling cblas_sgemm in MKL?

The code is like this:
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group A>);
When the matrix is not very large, the fork-join cost is very obvious, especially when this is run on MIC. Besides, separate the mission by hand will cause some problem on MIC as MKL Performance on Intel Phi shows.
//separate the left and result matrix by hand.
//not a wise solution on MIC
#pragma omp parallel
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group B>);
If there is a technique that I can use code:
#pragma omp parallel
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group A>);
where cblas_sgemm uses the threads forked out of the for loop since MKL also uses OpenMP to create threads.
Sincerely, FatRabb1t.
You could do that by linking the sequential version of MKL, so that cblas_sgemm will not fork multiple threads to calculate the matrix.
On ther other hand you could use OpenMP parallel for to speed up your code.
#pragma omp parallel for
for(int i = 0; i < loop_count; i++)
cblas_sgemm(<paras group B>);
By this way, you fork-join the threads only once instead of loop_count times.
If you are using Intel compiler icc/icpc, you could link the sequential MKL with the compiler option -mkl=sequential instead of -mkl.
If you are using other compilers such as gcc, you could use MKL link line advisor to help you generate the desired link line options.
https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor

Do virtual cores contribute to performance when parallelizing a matrix multiplication?

I have an O(n^3) matrix multiplication function in C.
void matrixMultiplication(int N, double **A, double **B, double **C, int threadCount) {
int i = 0, j = 0, k = 0, tid;
pragma omp parallel num_threads(4) shared(N, A, B, C, threadCount) private(i, j, k, tid) {
tid = omp_get_thread_num();
pragma omp for
for (i = 1; i < N; i++)
{
printf("Thread %d starting row %d\n", tid, i);
for (j = 0; j < N; j++)
{
for (k = 0; k < N; k++)
{
C[i][j] = C[i][j] + A[i][k] * B[k][j];
}
}
}
}
return;
}
I am using OpenMP to parallelize this function by splitting up the multiplications. I am performing this computation on square matrices of size N = 3000 with a 1.8 GHz Intel Core i5 processor.
This processor has two physical cores and two virtual cores. I noticed the following performances for my computation
1 thread: 526.06s
2 threads: 264.531
3 threads: 285.195
4 threads: 279.914
I had expected my gains to continue until the setting the number of threads equal to four. However, this obviously did not occur.
Why did this happen? Is it because the performance of a core is equal to the sum of its physical and virtual cores?
Using more than one hardware thread per core can help or hurt, depending on circumstances.
It can help if one hardware thread stalls because of a cache miss, and the other hardware thread can keep going and keep the ALU busy.
It can hurt if each hardware thread forces evictions of data needed by the other thread. That is the threads destructively interfere with each other.
One way to address the problem is to write the kernel in a way such that each thread needs only half the cache. For example, blocked matrix multiplication can be used to minimize the cache footprint of a matrix multiplication.
Another way is to write the algorithm in a way such that both threads operate on the same data at the same time, so they help each other bring data into cache (constructive interference). This approach is admittedly hard to do with OpenMP unless the implementation has good support for nested parallelism.
I guess that the bottleneck is the memory (or L3 CPU cache) bandwidth. Arithmetic is quite cheap these days.
If you can afford it, try to benchmark the same code with the same data on some more powerful processor (e.g. some socket 2013 i7)
Remember that on today's processors, a cache miss lasts as long as several hundred instructions (or cycles): RAM is very slow w.r.t. cache or CPU.
BTW, if you have a GPGPU you could play with OpenCL.
Also, it is probable that linear software packages like LAPACK (or some other numerical libraries) are more efficient than your naive matrix multiplication.
You could also consider using __builtin_prefetch (see this)
BTW, numerical computation is hard. I am not expert at all, but I met people who worked dozens of years in it (often after a PhD in the field).

OMP Optimizing nested loop with if statement

I have the following few lines of code that I am trying to run in parallel
void optimized(int data_len, unsigned int * input_array, unsigned int * output_array, unsigned int * filter_list, int filter_len) {
#pragma omp parallel for
for (int j = 0; j < filter_len; j++) {
for (int i = 0; i < data_len; i++) {
if (input_array[i] == filter_list[j]) {
output_array[i] = filter_list[j];
}
}
}
}
Just putting the pragma statement has really done wonders, but I am trying to further reduce the run time of this code. I have tried many things ranging from array padding to collapsing the loops to creating tasks, but the only thing that has seemed to work thus far is loop unrolling. Does anyone have any suggestions on what I could possibly due to further speed up this code?
You are doing pure memory accessing. That is limited by the memory bandwidth of the machine.
Multi-threading is not going to help you much. gcc -O2 already provide you SSE instruction optimization. So it may not help either to use intel instruction directly. You may try to check 4 int at once because SSE support 128 register (please see https://gcc.gnu.org/onlinedocs/gcc-4.4.5/gcc/X86-Built_002din-Functions.html and google for some example) Also to reduce the amount of data helps, by using short instead of int if you can.

OpenMP with 1 thread slower than sequential version

I have implemented knapsack using OpenMP (gcc version 4.6.3)
#define MAX(x,y) ((x)>(y) ? (x) : (y))
#define table(i,j) table[(i)*(C+1)+(j)]
for(i=1; i<=N; ++i) {
#pragma omp parallel for
for(j=1; j<=C; ++j) {
if(weights[i]>j) {
table(i,j) = table(i-1,j);
}else {
table(i,j) = MAX(profits[i]+table(i-1,j-weights[i]), table(i-1,j));
}
}
}
execution time for the sequential program = 1s
execution time for the openmp with 1 thread = 1.7s (overhead = 40%)
Used the same compiler optimization flags (-O3) in the both cases.
Can someone explain the reason behind this behavior.
Thanks.
Enabling OpenMP inhibits certain compiler optimisations, e.g. it could prevent loops from being vectorised or shared variables from being kept in registers. Therefore OpenMP-enabled code is usually slower than the serial and one has to utilise the available parallelism to offset this.
That being said, your code contains a parallel region nested inside the outer loop. This means that the overhead of entering and exiting the parallel region is multiplied N times. This only makes sense if N is relatively small and C is significantly larger (like orders of magnitude larger) than N, therefore the work being done inside the region greatly outweighs the OpenMP overhead.

Resources