I wrote the Matrix-Vector product program using OpenMP and AVX2.
However, I got the wrong answer because of OpenMP.
The true answer is all of the value of array c would become 100.
My answer was mix of 98, 99, and 100.
The actual code is below.
I compiled Clang with -fopenmp, -mavx, -mfma.
#include "stdio.h"
#include "math.h"
#include "stdlib.h"
#include "omp.h"
#include "x86intrin.h"
void mv(double *a,double *b,double *c, int m, int n, int l)
int k;
#pragma omp parallel
__m256d va,vb,vc;
int i;
#pragma omp for private(i, va, vb, vc) schedule(static)
for (k = 0; k < l; k++) {
vb = _mm256_broadcast_sd(&b[k]);
for (i = 0; i < m; i+=4) {
va = _mm256_loadu_pd(&a[m*k+i]);
vc = _mm256_loadu_pd(&c[i]);
vc = _mm256_fmadd_pd(vc, va, vb);
_mm256_storeu_pd( &c[i], vc );
int main(int argc, char* argv[]) {
// set variables
int m;
double* a;
double* b;
double* c;
int i;
// main program
// set vector or matrix
a=(double *)malloc(sizeof(double) * m*m);
b=(double *)malloc(sizeof(double) * m*1);
c=(double *)malloc(sizeof(double) * m*1);
for (i=0;i<m;i++) {
for (i=m;i<m*m;i++) {
mv(a, b, c, m, 1, m);
for (i=0;i<m;i++) {
printf("%e\n", c[i]);
return 0;
I know critical section would help. However critical section was slow.
So, how can I solve the problem?
The fundamental operation you want is
c[i] = a[i,k]*b[k]
If you use row-major order storage this becomes
c[i] = a[i*l + k]*b[k]
If you use column-major order storage this becomes
c[i] = a[k*m + i]*b[k]
For row-major order you can parallelize like this
#pragma omp parallel for
for(int i=0; i<m; i++) {
for(int k=0; k<l; k++) {
c[i] += a[i*l+k]*b[k];
For column-major order you can parallelize like this
#pragma omp parallel
for(int k=0; k<l; k++) {
#pragma omp for
for(int i=0; i<m; i++) {
c[i] += a[k*m+i]*b[k];
Matrix-vector operations are Level 2 operations which are memory bandwidth bound operation. The Level 1 and Level 2 operations don't scale e.g with the number of cores. It's only the Level 3 operations (e.g. dense matrix multiplication) which scale https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms#Level_3.
The issue is not with your AVX intrinsics, let's look at the code without the intrinsics for a minute:
void mv(double *a,double *b,double *c, int m, int n, int l)
#pragma omp parallel for schedule(static)
for (int k = 0; k < l; k++) {
double xb = b[k];
for (int i = 0; i < m; i++) {
double xa = a[m*k+i];
double xc = c[i];
xc = xc + xa * xb;
c[i] = xc;
Note: your private declaration was technically correct and redundant because declared inside of the parallel loop, but it is just so much easier to reason about the code if you declare the variables as locally as possible.
The race condition on your code is on c[i] - which multiple threads try to update. Now even if you could protect that with say an atomic update, the performance would be horrible: Not only because of the protection, but because the data of c[i] has to be constantly shifted around between caches of different cores.
One thing you can do about this is to use an array reduction on c. This makes a private copy of c for each thread and they get merged at the end:
void mv(double *a,double *b,double *c, int m, int n, int l)
#pragma omp parallel for schedule(static) reduction(+:c[:m])
for (int k = 0; k < l; k++) {
for (int i = 0; i < m; i++) {
c[i] += a[m*k+i] * b[k];
This should be reasonably efficient as long as two m-vectors fit in your cache but you still may get a lot of overhead due to thread management overhead. Eventually you will be limited by memory bandwidth because in a vector-matrix multiplication you only have one computation per element read from a.
Anyway, you can of course swap i and k loops and save the reduction, but then your memory access pattern on a will be inefficient (strided) - so you should block the loop to avoid that.
Now if you look at the output of any modern compiler, it will generate SIMD code on its own. Of course you can apply your own SIMD intrinsics if you want to. But make sure that you handle the edge cases correctly if m is not divisible by 4 (you did not in your original version).
At the end of the day, if you really want performance - use the functions from a BLAS library (e.g. MKL). If you want to play around with optimization, there are ample of opportunities to go in deep details.
I'm learning OpenMP and I'm trying to do a simple task: A[r][c] * X[c] = B[r] (matrix vector multiplication).
The problem is: the sequential code is faster than parallel and I don't know why!
My code:
#include <omp.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <sys/types.h>
// Defined variables
#define row_matriz_A 80000
#define col_matriz_A 800
#define THREADS_NUM 4
void gerarMatrizes(int r, int c, int mA[], int vX[], int vB[]){...}
void multSequencial(int r, int c, int mA[], int vX[], int vB[]){
// Variables
int i, j, offset, sum;
struct timeval tv1,tv2;
double t1, t2;
// Begin Time
gettimeofday(&tv1, NULL);
t1 = (double)(tv1.tv_sec) + (double)(tv1.tv_usec)/ 1000000.00;
for(i = 0; i < r; i++){
sum = 0;
for(j = 0; j < c; j++){
offset = i * c + j;
sum += mA[offset] * vX[j];
vB[i] = sum;
// End time
gettimeofday(&tv2, NULL);
t2 = (double)(tv2.tv_sec) + (double)(tv2.tv_usec)/ 1000000.00;
printf("\nO tempo de execucao sequencial foi: %lf segundos.\n", (t2 - t1));
void matvecHost(int r, int c, int mA[], int vX[], int vB[]){
// Variaveis
int tID, i, j, offset, sum;
struct timeval tv1, tv2;
double t1, t2;
// Init vB
for(i = 0; i < r; i++) vB[i] = 0;
// BEGIN Time
gettimeofday(&tv1, NULL);
t1 = (double)(tv1.tv_sec) + (double)(tv1.tv_usec)/ 1000000.00;
#pragma omp parallel private(tID, i, j) shared(mA, vB, vX)
tID = omp_get_thread_num();
#pragma omp for
for(i = 0; i < r; i++){
sum = 0;
for(j = 0; j < c; j++){
offset = i * c + j;
sum += mA[offset] * vX[j];
vB[i] = sum;
// End time
gettimeofday(&tv2, NULL);
t2 = (double)(tv2.tv_sec) + (double)(tv2.tv_usec)/ 1000000.00;
printf("\nO tempo de execucao OpenMP foi: %lf segundos.\n", (t2 - t1));
int main(int argc, char * argv[]) {
int row, col;
row = row_matriz_A;
col = col_matriz_A;
int *matrizA = (int *)calloc(row * col, sizeof(int));
int *vectorX = (int *)calloc(col * 1, sizeof(int));
int *vectorB = (int *)calloc(row * 1, sizeof(int));
gerarMatrizes(row, col, matrizA, vectorX, vectorB);
multSequencial(row, col, matrizA, vectorX, vectorB);
matvecHost(row, col, matrizA, vectorX, vectorB);
return 0;
Previous solutions that did not worked:
Use collapse in my squared for
Increse rows and columns size
Increase thread numbers (A teacher recommend to use thread number == threads physical number)
Use malloc instead of m[i][j]
My parallel block was correctly changed based on the correct answer:
#pragma omp parallel private(i, j, sum) shared(mA, vB, vX)
#pragma omp for
for(i = 0; i < r; i++){
sum = 0;
for(j = 0; j < c; j++){
sum += mA[i * c + j] * vX[j];
vB[i] = sum;
I still got some a doubt:
If I define i, j and sum inside my parallel block, they will be set as private automatically? This improve the speed in my code or not?
You have race conditions on sum and offset - those are shared between the threads instead of being thread-private.
This also likely explains the slowdown: On x86, the CPU will actually work hard to make sure accesses to shared variables "work". This involves flushing cache lines after every (!) write to offset and sum - so all the threads are wildly writing into the same variables, but each one has to wait until the write from the previous thread (on a different core) has arrived in the local cache again after having been flushed. And of course it will produce completely nonsensical results.
I don't know why you are declaring all your variables at the start of the function - that's prone to these kind of mistakes. If you declared i, j, sum and offset (and the unused tID) in the smallest possible scopes instead, you wouldn't ever had this problem because they would be thread-private automatically in that case.
recently I am working on a c OpenMP code which carrying out the affinity scheduling. Basically, after a thread has finished its assigned iterations, it will start looking for other threads which has the most work load and steal some jobs from them.
Everything works fine, I can compile the file using icc. However, when I try to run it, it gives me the segmentation fault(core dumped). But the funny thing is, the error is not always happen, that is, even I get an error when I first run the code, when I try to run again, sometimes it works. This is so weird to me. I wonder what I did wrong in my code and how to fix the problem. Thank you. I did only modified the method runloop and affinity, others are given at the beginning which works fine.
#include <stdio.h>
#include <math.h>
#define N 729
#define reps 1000
#include <omp.h>
double a[N][N], b[N][N], c[N];
int jmax[N];
void init1(void);
void init2(void);
void runloop(int);
void loop1chunk(int, int);
void loop2chunk(int, int);
void valid1(void);
void valid2(void);
int affinity(int*, int*, int, int, float, int*, int*);
int main(int argc, char *argv[]) {
double start1,start2,end1,end2;
int r;
start1 = omp_get_wtime();
for (r=0; r<reps; r++){
end1 = omp_get_wtime();
printf("Total time for %d reps of loop 1 = %f\n",reps, (float)(end1-start1));
start2 = omp_get_wtime();
for (r=0; r<reps; r++){
end2 = omp_get_wtime();
printf("Total time for %d reps of loop 2 = %f\n",reps, (float)(end2-start2));
void init1(void){
int i,j;
for (i=0; i<N; i++){
for (j=0; j<N; j++){
a[i][j] = 0.0;
b[i][j] = 3.142*(i+j);
void init2(void){
int i,j, expr;
for (i=0; i<N; i++){
expr = i%( 3*(i/30) + 1);
if ( expr == 0) {
jmax[i] = N;
else {
jmax[i] = 1;
c[i] = 0.0;
for (i=0; i<N; i++){
for (j=0; j<N; j++){
b[i][j] = (double) (i*j+1) / (double) (N*N);
void runloop(int loopid)
int nthreads = omp_get_max_threads(); // we set it before the parallel region, using opm_get_num_threads() will always return 1 otherwise
int ipt = (int) ceil((double)N/(double)nthreads);
float chunks_fraction = 1.0 / nthreads;
int threads_lo_bound[nthreads];
int threads_hi_bound[nthreads];
#pragma omp parallel default(none) shared(threads_lo_bound, threads_hi_bound, nthreads, loopid, ipt, chunks_fraction)
int myid = omp_get_thread_num();
int lo = myid * ipt;
int hi = (myid+1)*ipt;
if (hi > N) hi = N;
threads_lo_bound[myid] = lo;
threads_hi_bound[myid] = hi;
int current_lower_bound = 0;
int current_higher_bound = 0;
int affinity_steal = 0;
while(affinity_steal != -1)
case 1: loop1chunk(current_lower_bound, current_higher_bound); break;
case 2: loop2chunk(current_lower_bound, current_higher_bound); break;
#pragma omp critical
affinity_steal = affinity(threads_lo_bound, threads_hi_bound, nthreads, myid, chunks_fraction, ¤t_lower_bound, ¤t_higher_bound);
int affinity(int* threads_lo_bound, int* threads_hi_bound, int num_of_thread, int thread_num, float chunks_fraction, int *current_lower_bound, int *current_higher_bound)
int current_pos;
if (threads_hi_bound[thread_num] - threads_lo_bound[thread_num] > 0)
current_pos = thread_num;
int new_pos = -1;
int jobs_remain = 0;
int i;
for (i = 0; i < num_of_thread; i++)
int diff = threads_hi_bound[i] - threads_lo_bound[i];
if (diff > jobs_remain)
new_pos = i;
jobs_remain = diff;
current_pos = new_pos;
if (current_pos == -1) return -1;
int remaining_iterations = threads_hi_bound[current_pos] - threads_lo_bound[current_pos];
int iter_size_fractions = (int)ceil(chunks_fraction * remaining_iterations);
*current_lower_bound = threads_lo_bound[current_pos];
*current_higher_bound = threads_lo_bound[current_pos] + iter_size_fractions;
threads_lo_bound[current_pos] = threads_lo_bound[current_pos] + iter_size_fractions;
return current_pos;
void loop1chunk(int lo, int hi) {
int i,j;
for (i=lo; i<hi; i++){
for (j=N-1; j>i; j--){
a[i][j] += cos(b[i][j]);
void loop2chunk(int lo, int hi) {
int i,j,k;
double rN2;
rN2 = 1.0 / (double) (N*N);
for (i=lo; i<hi; i++){
for (j=0; j < jmax[i]; j++){
for (k=0; k<j; k++){
c[i] += (k+1) * log (b[i][j]) * rN2;
void valid1(void) {
int i,j;
double suma;
suma= 0.0;
for (i=0; i<N; i++){
for (j=0; j<N; j++){
suma += a[i][j];
printf("Loop 1 check: Sum of a is %lf\n", suma);
void valid2(void) {
int i;
double sumc;
sumc= 0.0;
for (i=0; i<N; i++){
sumc += c[i];
printf("Loop 2 check: Sum of c is %f\n", sumc);
You don't initialise the arrays threads_lo_bound and threads_hi_bound, so they initially contain some completely random values (this is source of randomness number 1).
You then enter the parallel region, where it is imperative to realise not all threads will be moving through the code in sync, the actual speed of each threads is quite random as it shares the CPU with many other programs, even if they only use 1%, that will still show (this is source of randomness number 2, I'd argue this one is more relevant to why you see it working every now and then).
So what happens when the code crashes?
One of the threads (most likely the master) reaches the critical region before at least one of the other threads has reached the line where you set threads_lo_bound[myid] and threads_hi_bound[myid].
After that, depending on what those random values stored in there were (you can generally assume they were out of bounds, your array is fairly small, the odds of those values being valid indices are pretty slim), the thread will try to steal some of the jobs (that don't exist) by setting current_lower_bound and/or current_upper_bound to some value that is out of range of your initial arrays a, b, c.
It will then enter the second iteration of your while(affinity_steal != -1) loop and access memory that is out of bounds inevitably leading to a segmentation fault (eventually, in principle it's undefined behaviour and the crash can occur at any point after an invalid memory access, or in some cases never, leading you to believe everything is in order, when it is most definitely not).
The fix of course is simple, add
#pragma omp barrier
just before the while(affinity_steal != -1) loop to ensure all threads have reached that point (i.e. synchronise the threads at that point) and the bounds are properly set before you proceed into the loop. The overhead of this is minimal, but if for some reason you wish to avoid using barriers, you can simply set the values of the array before entering the parallel region.
That said, bugs like this can usually be located using a good debugger, I strongly suggest learning how to use one, they make life much easier.
I would like to achieve very efficient parallel reduction operation (i.e. summation): each of columns of a 2-dimensional array (memory buffer in row mayor memory layout) should be summed to an entry of a 1-dimensional array.
To be more clear about the expected input and output
double* array = malloc(sizeof(double) * shape0 * shape1) /* (shape0*shape1) 2-d array */
double* out = malloc(sizeof(double) * shape1) /* where out[j] = sum_j(array_ij) */
Parallelising the sum of rows is pretty straightforward and efficient because the values are contiguous in memory and there is no risk of race conditions. I found this works really well
void sum_rows(double* array, int shape0, int shape1, double* out) {
int i, j;
#pragma omp parallel for private(j) schedule(guided)
for (i=0; i < shape0; i++){
for (j=0; j < shape1; j++){
out[i] += array[shape1 * i + j];
I am founding more difficult to parallelise over the other axis.
This should be a straightforward parallel recipe but I was not able to find a definitive answer what is the most efficient way to program this.
This is the naive serial code I would like to write an efficient parallel version of:
void sum_columns(double* array, int shape0, int shape1, double* out) {
int i, j;
for (i=0; i < shape0; i++){
for (j=0; j < shape1; j++){
out[j] += array[shape1 * i + j];
I have already read the following q/a but they didn't lead me to any speedup over the naive sequential code:
Parallelizing matrix times a vector by columns and by rows with OpenMP
OpenMP average of an array
Reduction with OpenMP
Just reporting the faster implementation I was able to achieve after some attempts. Here I assign columns to the different threads, in such a way to work as local as possible and to avoid false sharing.
void sum_columns(double* array, int N_rows, int N_cols, double* out, int n_threads) {
#pragma omp parallel
/* private vars */
int i, j, id, N_threads, col_chunk_size, start_col, end_col;
/* ICVs */
id = omp_get_thread_num();
N_threads = omp_get_num_threads();
/* distribute cols to different threads */
col_chunk_size = N_cols / N_threads;
start_col = id * col_chunk_size;
end_col = (id+1) * col_chunk_size;
if (id == N_threads - 1) end_col = N_cols;
/* main loop */
for (i=0; i < N_rows; i++){
for (j=start_col; j < end_col; j++){
out[j] += array[N_cols * i + j];
I am an OpenMP beginner. I come across such a problem.
I have a mask array M with the length N, whose element is either 0 or 1. I hope to extract all indices i that satisfies M[i]=1 and store them into a new array T.
Can this problem be accelerated by OpenMP?
I have tried following code. But it is not performance effective.
int count = 0;
#pragma omp parallel for
for(int i = 0; i < N; ++i) {
if(M[i] == hashtag) {
int pos = 0;
#pragma omp critical (c1)
pos = count++;
T[pos] = i;
I am not 100% sure this will be much better, but you could try the following:
int count = 0;
#pragma omp parallel for
for(int i = 0; i < N; ++i) {
if(M[i]) {
#pragma omp atomic
T[count++] = i;
If the array is quite sparse, threads will be able to zip through a lot of zeros without waiting for others. But you can only update one index at a time. The problem is really that different threads are writing to the same memory block (T), which means you will be running into issues of caching: every time one thread writes to T, the cache of all the other cores is "dirty" - so when they try to modify it, a lot of shuffling goes on behind the scenes. All this is transparent to you (you don't need to write code to handle it) but it slows things down signficantly - I suspect that's your real bottleneck. If your matrix is big enough to make it worth your while, you might try to do the following:
Create as many arrays T as there are threads
Let each thread update its own version of T
Combine all the T arrays into one after the loops have completed
It might be faster (because the different threads don't write to the same memory) - but since there are so few statements inside the loop, I suspect it won't be.
EDIT I created a complete test program, and found two things. First, the atomic directive doesn't work in all versions of omp, and you may well have to use T[count++] += i; for it to even compile (which is OK since T can be set to all zeros initially); more troubling, you will not get the same answer twice if you do this (the final value of count changes from one pass to the next); if you use critical, that doesn't happen.
A second observation is that the speed of the program really slows down when you increase the number of threads, which confirms what I was suspecting about shared memory (times for 10M elements processed:
threads elapsed
1 0.09s
2 0.73s
3 1.21s
4 1.62s
5 2.34s
You can see this is true by changing how sparse matrix M is - when I create M as a random array, and test for M[i] < 0.01 * RAND_MAX (0.1% dense matrix), things run much more quickly than if I make it 10% dense - showing that the part inside the critical section is really slowing us down.
That being the case, I don't think there is a way of speeding up this task in OMP - the job of consolidating the outputs of all the threads into a single list at the end is just going to eat up any speed advantage you may have had, given how little is going on inside the inner loop. So rather than using multiple threads, I suggest you rewrite the loop as efficiently as possible - for example:
for( i = 0; i < N; i++) {
T[count] = i;
count += M[i];
In my quick benchmark, this was faster than the OMP solution - comparable with the threads = 1 solution. Again - this is because of the way memory is being accessed here. Note that I avoid using an if statement - this keeps the code as fast as possible. Instead, I take advantage of the fact that M[i] is always zero or one. At the end of the loop you have to discard the element T[count] because it will be invalid... the "good elements" are T[0] ... T[count-1]. An array M with 10M elements was processed by this loop in ~ 0.02 sec on my machine. Should be sufficient for most purposes?
Based on Floris's fast function I tried to see if I could find a way to find a faster solution with OpenMP. I came up with two functions foo_v2 and foo_v3 which are faster for larger arrays, foo_v2 is faster independent of density and foo_v3 is faster for sparser arrays. The function foo_v2 essentially creates a 2D array with width N*nthreads as well as an array countsa which contains the counts for each thread. This is better explained with code. The following code would loop over all the elements written out to T.
for(int ithread=0; ithread<nthreads; ithread++) {
for(int i=0; i<counta[ithread]; i++) {
T[ithread*N/nthread + i]
The functionfoo_v3creates a 1D array as requested. In all casesNhas to be pretty large to overcome the OpenMP overhead. The code below defaults to 256MB with a density ofMabout 10%. The OpenMP functions are both faster by over a factor of 2 on my 4 core Sandy Bridge system. If you put the density at 50%foo_v2is faster still by about about a factor of 2 butfoo_v3is no longer faster.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int foo_v1(int *M, int *T, const int N) {
int count = 0;
for(int i = 0; i<N; i++) {
T[count] = i;
count += M[i];
return count;
int foo_v2(int *M, int *T, int *&counta, const int N) {
int nthreads;
#pragma omp parallel
nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
#pragma omp single
counta = new int[nthreads];
int count_private = 0;
#pragma omp for
for(int i = 0; i<N; i++) {
T[ithread*N/nthreads + count_private] = i;
count_private += M[i];
counta[ithread] = count_private;
return nthreads;
int foo_v3(int *M, int *T, const int N) {
int count = 0;
int *counta = 0;
#pragma omp parallel reduction(+:count)
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
#pragma omp single
counta = new int[nthreads+1];
counta[0] = 0;
int *Tprivate = new int[N/nthreads];
int count_private = 0;
#pragma omp for nowait
for(int i = 0; i<N; i++) {
Tprivate[count_private] = i;
count_private += M[i];
counta[ithread+1] = count_private;
count += count_private;
#pragma omp barrier
int offset = 0;
for(int i=0; i<(ithread+1); i++) {
offset += counta[i];
for(int i=0; i<count_private; i++) {
T[offset + i] = Tprivate[i];
delete[] Tprivate;
delete[] counta;
return count;
void compare(const int *T1, const int *T2, const int N, const int count, const int *counta, const int nthreads) {
int diff = 0;
int n = 0;
for(int ithread=0; ithread<nthreads; ithread++) {
for(int i=0; i<counta[ithread]; i++) {
int i2 = N*ithread/nthreads+i;
//printf("%d %d\n", T1[n], T2[i2]);
int tmp = T1[n++] - T2[i2];
if(tmp<0) tmp*=-1;
diff += tmp;
printf("diff %d\n", diff);
void compare_v2(const int *T1, const int *T2, const int count) {
int diff = 0;
int n = 0;
for(int i=0; i<count; i++) {
int tmp = T1[i] - T2[i];
//if(tmp!=0) printf("%i %d %d\n", i, T1[i], T2[i]);
if(tmp<0) tmp*=-1;
diff += tmp;
printf("diff %d\n", diff);
int main() {
const int N = 1 << 26;
printf("%f MB\n", 4.0*N/1024/1024);
int *M = new int[N];
int *T1 = new int[N];
int *T2 = new int[N];
int *T3 = new int[N];
int *counta;
double dtime;
for(int i=0; i<N; i++) {
M[i] = ((rand()%10)==0);
//int repeat = 10000;
int repeat = 1;
int count1, count2;
int nthreads;
dtime = omp_get_wtime();
for(int i=0; i<repeat; i++) count1 = foo_v1(M, T1, N);
dtime = omp_get_wtime() - dtime;
printf("time v1 %f\n", dtime);
dtime = omp_get_wtime();
for(int i=0; i<repeat; i++) nthreads = foo_v2(M, T2, counta, N);
dtime = omp_get_wtime() - dtime;
printf("time v2 %f\n", dtime);
compare(T1, T2, N, count1, counta, nthreads);
dtime = omp_get_wtime();
for(int i=0; i<repeat; i++) count2 = foo_v3(M, T3, N);
dtime = omp_get_wtime() - dtime;
printf("time v2 %f\n", dtime);
printf("count1 %d, count2 %d\n", count1, count2);
compare_v2(T1, T3, count1);
The critical operation should be atomic instead of critical; actually in your case you have to use the atomic capture clause:
int pos, count = 0; // pos declared outside the loop
#pragma omp parallel for private(pos) // and privatized, count is implicitly
for(int i = 0; i < N; ++i) { // shared by all the threads
if(M[i]) {
#pragma omp atomic capture
pos = count++;
T[pos] = i;
Take a look at this answer to have an overview over all the possible possibilities of atomic operations with OpenMP.
I want to do a reduction on an array using OpenMP and SIMD. I read that a reduction in OpenMP is equivalent to:
inline float sum_scalar_openmp2(const float a[], const size_t N) {
float sum = 0.0f;
#pragma omp parallel
float sum_private = 0.0f;
#pragma omp parallel for nowait
for(int i=0; i<N; i++) {
sum_private += a[i];
#pragma omp atomic
sum += sum_private;
return sum;
I got this idea from the follow link:
But atomic also does not support complex operators. What I did was replace atomic with critical and implemented the reduction with OpenMP and SSE like this:
#define ROUND_DOWN(x, s) ((x) & ~((s)-1))
inline float sum_vector4_openmp(const float a[], const size_t N) {
__m128 sum4 = _mm_set1_ps(0.0f);
#pragma omp parallel
__m128 sum4_private = _mm_set1_ps(0.0f);
#pragma omp for nowait
for(int i=0; i < ROUND_DOWN(N, 4); i+=4) {
__m128 a4 = _mm_load_ps(a + i);
sum4_private = _mm_add_ps(a4, sum4_private);
#pragma omp critical
sum4 = _mm_add_ps(sum4_private, sum4);
__m128 t1 = _mm_hadd_ps(sum4,sum4);
__m128 t2 = _mm_hadd_ps(t1,t1);
float sum = _mm_cvtss_f32(t2);
for(int i = ROUND_DOWN(N, 4); i < N; i++) {
sum += a[i];
return sum;
However, this function does not perform as well as I hope. I'm using Visual Studio 2012 Express. I know I can improve the performance a bit by unrolling the SSE load/add a few times but that still is less than I expect.
I get much better performance by running over slices of the arrays equal to the number of threads:
inline float sum_slice(const float a[], const size_t N) {
int nthreads = 4;
const int offset = ROUND_DOWN(N/nthreads, nthreads);
float suma[8] = {0};
#pragma omp parallel for num_threads(nthreads)
for(int i=0; i<nthreads; i++) {
suma[i] = sum_vector4(&a[i*offset], offset);
float sum = 0.0f;
for(int i=0; i<nthreads; i++) {
sum += suma[i];
for(int i=nthreads*offset; i < N; i++) {
sum += a[i];
return sum;
inline float sum_vector4(const float a[], const size_t N) {
__m128 sum4 = _mm_set1_ps(0.0f);
int i = 0;
for(; i < ROUND_DOWN(N, 4); i+=4) {
__m128 a4 = _mm_load_ps(a + i);
sum4 = _mm_add_ps(sum4, a4);
__m128 t1 = _mm_hadd_ps(sum4,sum4);
__m128 t2 = _mm_hadd_ps(t1,t1);
float sum = _mm_cvtss_f32(t2);
for(; i < N; i++) {
sum += a[i];
return sum;
Does someone know if there is a better way of doing reductions with more complicated operators in OpenMP?
I guess the answer to your question is No. I don't think there is a better way of doing reduction with more complicated operators in OpenMP.
Assuming that the array is 16 bit aligned, number of openmp threads is 4, one might expect the performance gain to be 12x - 16x by OpenMP + SIMD. In realistic, it might not produce enough performance gain because
There is a overhead in creating the openmp threads.
The code is doing 1 addition operation for 1 Load operation. Hence, the CPU isn't doing enough computation. So, it almost looks like the CPU spends most of the time in loading the data, kind of memory bandwidth bound.