OpenMP Parallelize code inside a for loop - c

I want to parallelize tasks inside a for loop using OpenMP. However, I do not want to use #pragma omp parallel for as the result of the (i+1)th iteration depends on the output of the (i)th iteration. I have tried to spawn the threads inside the code, but the time of creating and destroying them every time is very high. An abstract description of my code is:
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
for (int i=0; i<1000; i++)
{
a_new = fun(a_old); //fun() depends only on the value of the argument
a_old = a_new;
b_new = fun(b_old);
b_old = b_new;
c_new = fun(c_old);
c_old = c_new;
d_new = fun(d_old);
d_old = d_new;
}
How can I efficiently use threads to calculate the new values of a_new, b_new, c_new, d_new in parallel in each iteration ?

Just don't parallelize the code inside the for loop - move the parallel region to the outside. This reduces the thread creation and worksharing overhead. Then you can easily apply OpenMP sections:
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
#pragma omp parallel sections
{
#pragma omp section
for (int i=0; i<1000; i++) {
a_new = fun(a_old); //fun() depends only on the value of the argument
a_old = a_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
b_new = fun(b_old);
b_old = b_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
c_new = fun(c_old);
c_old = c_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
d_new = fun(d_old);
d_old = d_new;
}
}
There is also another simplification:
int value[4];
#pragma omp parallel for
for (int abcd = 0; abcd < 4; abcd++) {
for (int i=0; i<1000; i++) {
value[abcd] = fun(value[abcd]);
}
}
In either case, you might want to consider adding padding between the values to avoid false sharing if fun executes rather quickly.

This pretty straight forward, as #kbr mentioned in the comments each of the calculation a,b,c and d are independent, so you can separate them to different threads and pass the corresponding value as parameter. The sample code looks like this.
#include<stdio.h>
#include <pthread.h>
void *thread_func(int *i)
{
for (int j=0; j<1000; j++)
{
//Instead of increment u can call whichever function you want here.
(*i)++;
}
}
int main()
{
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
pthread_t thread[4];
pthread_create(&thread[0],0,thread_func,&a_old);
pthread_create(&thread[1],0,thread_func,&b_old);
pthread_create(&thread[2],0,thread_func,&c_old);
pthread_create(&thread[3],0,thread_func,&d_old);
pthread_join(&thread[0],NULL);
pthread_join(&thread[1],NULL);
pthread_join(&thread[2],NULL);
pthread_join(&thread[3],NULL);
printf("a_old %d",a_old);
printf("b_old %d",b_old);
printf("c_old %d",c_old);
printf("d_old %d",d_old);
}

Related

Mixing OpenMP and xmmintrin SSE Intrinsics - not getting speedup over the non-parallel version

I've implemented a version of the Travelling Salesman with xmmintrin.h SSE instructions, received a decent speedup. But now I'm also trying to implement OpenMP threading on top of it, and I'm seeing a pretty drastic slow down. I'm getting the correct answer in both cases (i.e. (i) with SSE only, or (ii) with SSE && OpenMP).
I know I am probably doing something wildly wrong, and maybe someone much more experienced than me can spot the issue.
The main loop of my program has the following (brief) pseudocode:
int currentNode;
for(int i = 0; i < numNodes; i++) {
minimumDistance = DBL_MAX;
minimumDistanceNode;
for(int j = 0; j < numNodes; j++) {
// find distance between 'currentNode' to j-th node
// ...
if(jthNodeDistance < minimumDistance) {
minimumDistance = jthNodeDistance;
minimumDistanceNode = jthNode;
}
}
currentNode = minimumDistanceNode;
}
And here is my implementation, that is still semi-pseudocode as I've still brushed over some parts that I don't think have an impact on performance, I think the issues to be found with my code can be found in the following code snippet. If you just omit the #pragma lines, then the following is pretty much identical to the SSE only version of the same program, so I figure I should only include the OpenMP version:
int currentNode = 0;
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++) {
miniumum = DBL_MAX;
__m128 currentNodeX = _mm_set1_ps(xCoordinates[currentNode]);
__m128 currentNodeY = _mm_set1_ps(yCoordinates[currentNode]);
#pragma omp parallel num_threads(omp_get_max_threads())
{
float localMinimum = DBL_MAX;
float localMinimumNode;
#pragma omp for
for (int j = 0; j < loopEnd; j += 4) {
// a number of SSE vector calculations to find distance
// between the current node and the four nodes we're looking
// at in this iteration of the loop:
__m128 subXs_0 = _mm_sub_ps(currentNodeX, _mm_load_ps(&xCoordinates[j]));
__m128 squareSubXs_0 = _mm_mul_ps(subXs_0, subXs_0);
__m128 subYs_0 = _mm_sub_ps(currentNodeY, _mm_load_ps(&yCoordinates[j]));
__m128 squareSubYs_0 = _mm_mul_ps(subYs_0, subYs_0);
__m128 addXY_0 = _mm_add_ps(squareSubXs_0, squareSubYs_0);
float temp[unroll];
_mm_store_ps(&temp[0], addXY_0);
// skipping stuff here that is about getting the minimum distance and
// it's equivalent node, don't think it's massively relevant but
// each thread will have its own
// localMinimum
// localMinimumNode
}
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
So my logic is:
omp single for the top-level for(int i) for loop, and I only want 1 thread dedicated to this
omp parallel num_threads(omp_get_max_threads()) for the inner for(int j) for-loop, as I want all cores working on this part of the code at the same time.
omp critical at the end of the full for(int j) loop, as I want to thread-safely update the current node.
In terms of run-time, the OpenMP version is typically twice as slow as the SSE-only version.
Does anything jump out at you as particularly bad in my code, that is causing this drastic slow-down for OpenMP?
Does anything jump out at you as particularly bad in my code, that is
causing this drastic slow-down for OpenMP?
First:
omp single for the top-level for(int i) for loop, and I only want 1
thread dedicated to this
In your code you have the following:
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++)
{
#pragma omp parallel num_threads(omp_get_max_threads())
{
//....
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
The #pragma omp parallel creates a team of threads, but then only one thread executes a parallel task (i.e., #pragma omp single) while the other threads don't do anything. You can simplified to:
for (int i = 1; i < totalNum; i++)
{
#pragma omp parallel num_threads(omp_get_max_threads())
{
//....
}
ThisPt = minimumNode;
}
The inner only is still executed by only one thread.
Second :
omp parallel num_threads(omp_get_max_threads()) for the inner for(int
j) for-loop, as I want all cores working on this part of the code at
the same time.
The problem is that this might return the number of logic-cores and not physical cores, and some codes might perform worse with hyper-threading. So, I would first test with a different number of threads, starting from 2, 4 and so on, until you find a number to which the code stops scaling.
omp critical at the end of the full for(int j) loop, as I want to
thread-safely update the current node.
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
this can be replaced by creating an array where each thread save its local minimum in a position reserved to that thread, and outside the parallel region the initial thread extract the minimum and minimumNode:
int total_threads = /..;
float localMinimum[total_threads] = {DBL_MAX};
float localMinimumNode[total_threads] = {DBL_MAX};
#pragma omp parallel num_threads(total_threads)
{
/...
}
for(int i = 0; i < total_threads; i++){
if (localMinimum[i] < minimum) {
minimum = localMinimum[i];
minimumNode = localMinimumNode[i];
}
}
Finally, after those changes are done, you try to check if it is possible to replace this parallelization by the following:
#pragma omp parallel for
for (int i = 1; i < totalNum; i++)
{
...
}

How to synchronize 3 nested loop in OpenMP?

I am writing a program that will match up one block(a group of 4 double numbers which are within certain absolute value) with another.
Essentially, I will call the function in main.
The matrix has 4399 rows and 500 columns.I am trying to use OpenMp to speed up the task yet my code seems to have race condition within the innermost loop (where the actual creation of block happens create_Block(rrr[k], i); ).
It is ok to ignore all the function detail as they are working well in serial version. The only focus here is the OpenMP derivatives.
int main(void) {
readKey("keys.txt");
double** jz = readMatrix("data.txt");
int j = 0;
int i = 0;
int k = 0;
#pragma omp parallel for firstprivate(i) shared(Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (i = 0; i < 50; i++) {
printf("THIS IS COLUMN %d\n", i);
double*c = readCol(jz, i, 4400);
#pragma omp parallel for firstprivate(j) shared(i,Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (j=0; j < 4400; j++) {
// printf("This is fixed row %d from column %d !!!!!!!!!!\n",j,i);
int* one_collection = collection(c, j, 4400);
// MODIFY THE DYMANIC ALLOCATION OF SPACES (SIZE_OF_COMBINATION) IN combNonRec() function.
if (get_combination_size(SIZE_OF_COLLECTION, M) >= 4) {
//GET THE 2D-ARRAY OF COMBINATION
int** rrr = combNonRec(one_collection, SIZE_OF_COLLECTION, M);
#pragma omp parallel for firstprivate(k) shared(i,j,Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (k = 0; k < get_combination_size(SIZE_OF_COLLECTION, M); k++) {
create_Block(rrr[k], i); //ACTUAL CREATION OF BLOCK !!!!!!!
printf("This is block %d \n", NUM_OF_BLOCK);
add_To_Block_Collection();
}
free(rrr);
}
free(one_collection);
}
//OpenMP for j
free(c);
}
// OpenMP for i
collision();
}
Here is the parallel version result: non-deterministic
Whereas the serial result has constant 400 blocks.
Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION are global variable.
Did I do anything wrong in the derivative declaration? What might have caused such problem?

Multithreaded program outputs different results every time it runs

I have been trying to create a Multithreaded program that calculates the multiples of 3 and 5 from 1 to 999 but I can't seem to get it right every time I run it I get a different value I think it might have to do with the fact that I use a shared variable with 10 threads but I have no idea how to get around that. Also The program does work if I calculate the multiples of 3 and 5 from 1 to 9.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#define NUM_THREADS 10
#define MAX 1000
//finds multiples of 3 and 5 and sums up all of the multiples
int main(int argc, char ** argv)
{
omp_set_num_threads(10);//set number of threads to be used in the parallel loop
unsigned int NUMS[1000] = { 0 };
int j = 0;
#pragma omp parallel
{
int ID = omp_get_thread_num();//get thread ID
int i;
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
}
int i = 0;
unsigned int total;
for(i = 0; NUMS[i] != 0; i++)total += NUMS[i];//add up multiples of 3 and 5
printf("Total : %d\n", total);
return 0;
}
"j++" is not an atomic operation.
It means "take the value contained at the storage location called j, use it in the current statement, add one to it, then store it back in the same location it came from".
(That's the simple answer. Optimization and whether or not the value is kept in a register can and will change things even more.)
When you have multiple threads doing that to the same variable all at the same time, you get different and unpredictable results.
You can use thread variables to get around that.
In your code j is a shared inductive variable. You can't rely on using shared inductive variables efficiently with multiple threads (using atomic every iteration is not efficient).
You could find a special solution not using inductive variables (for example using wheel factorization with seven spokes {0,3,5,6,9,10,12} out of 15) or you could find a general solution using private inductive variables like this
#pragma omp parallel
{
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
#pragma omp for schedule(static) ordered
for(i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
{
memcpy(&NUMS[j], NUMS_local, sizeof *NUMS *k);
j += k;
}
}
}
This solution does not make optimal use of memory however. A better solution would use something like std::vector from C++ which you could implement for example using realloc in C but I'm not going to do that for you.
Edit:
Here is a special solution which does not use shared inductive variables using wheel factorization
int wheel[] = {0,3,5,6,9,10,12};
int n = MAX/15;
#pragma omp parallel for reduction(+:total)
for(int i=0; i<n; i++) {
for(int k=0; k<7; k++) {
NUMS[7*i + k] = 7*i + wheel[k];
total += NUMS[7*i + k];
}
}
//now clean up for MAX not a multiple of 15
int j = n*7;
for(int i=n*15; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS[j++] = i;
total += i;
}
}
Edit: It's possible to do this without a critical section (from the ordered clause). This does memcpy in parallel and also makes better use of memory at least for the shared array.
int *NUMS;
int *prefix;
int total=0, j;
#pragma omp parallel
{
int i;
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
{
prefix = malloc(sizeof *prefix * (nthreads+1));
prefix[0] = 0;
}
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
prefix[ithread+1] = k;
#pragma omp barrier
#pragma omp single
{
for(i=1; i<nthreads+1; i++) prefix[i+1] += prefix[i];
NUMS = malloc(sizeof *NUMS * prefix[nthreads]);
j = prefix[nthreads];
}
memcpy(&NUMS[prefix[ithread]], NUMS_local, sizeof *NUMS *k);
}
free(prefix);
This is a typical thread synchronization issue. All you need to do is using a kernel synchronization object for the sake of atomicity of any desired operation (incrementing the value of variable j in your case). It would be a mutex, semaphore or an event object depending on the operating system you're working on. But whatever your development environment is, to provide atomicity, the fundamental flow logic should be like the following pseudo-code:
{
lock(kernel_object)
// ...
// do your critical operation (increment your variable j in your case)
// ++j;
// ...
unlock(kernel_object)
}
If you're working on Windows operating system, there are some special synchronization mechanisms provided by the environment (i.e: InterlockedIncrement or CreateCriticalSection etc.) If you're working on a Unix/Linux based operating system, you can use mutex or semaphore kernel synchronization objects. Actually all those synchronization mechanism are stem from the concept of semaphores which is invented by Edsger W. Dijkstra in the begining of 1960's.
Here's some basic examples below:
Linux
#include <pthread.h>
pthread_mutex_t g_mutexObject = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char* argv[])
{
// ...
pthread_mutex_lock(&g_mutexObject);
++j; // incrementing j atomically
pthread_mutex_unlock(&g_mutexObject);
// ...
pthread_mutex_destroy(&g_mutexObject);
// ...
exit(EXIT_SUCCESS);
}
Windows
#include <Windows.h>
CRITICAL_SECTION g_csObject;
int main(void)
{
// ...
InitializeCriticalSection(&g_csObject);
// ...
EnterCriticalSection(&g_csObject);
++j; // incrementing j atomically
LeaveCriticalSection(&g_csObject);
// ...
DeleteCriticalSection(&g_csObject);
// ...
exit(EXIT_SUCCESS);
}
or just simply:
#include <Windows.h>
LONG volatile g_j; // our little j must be volatile in here now
int main(void)
{
// ...
InterlockedIncrement(&g_j); // incrementing j atomically
// ...
exit(EXIT_SUCCESS);
}
The problem you have is that threads doesn't necesarlly execute in order so the last thread to wirete may not have read the value in order so you overwrite wrong data.
There is a form to set that the threads in a loop, do a sumatory when they finish with the openmp options. You have to wirte somthing like this to use it.
#pragma omp parallel for reduction(+:sum)
for(k=0;k<num;k++)
{
sum = sum + A[k]*B[k];
}
/* Fin del computo */
gettimeofday(&fin,NULL);
all you have to do is write the result in "sum", this is from an old code i have that do a sumatory.
The other option you have is the dirty one. Someway, make the threads wait and get in order using a call to the OS. This is easier than it looks. This will be a solution.
#pragma omp parallel
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
printf("asdasdasdasdasdasdasdas");
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
but i recommendo you to read fully the openmp options.

Longest Common Subsequence with openMP

I'm writing a parallel version of the Longest Common Subsequence algorithm using openMP.
The sequential version is the following (and it works correctly):
// Preparing first row and first column with zeros
for(j=0; j < (len2+1); j++)
score[0][j] = 0;
for(i=0; i < (len1+1); i++)
score[i][0] = 0;
// Calculating scores
for(i=1; i < (len1+1); i++) {
for(j=1; j < (len2+1) ;j++) {
if (seq1[i-1] == seq2[j-1]) {
score[i][j] = score[i-1][j-1] + 1;
}
else {
score[i][j] = max(score[i-1][j], score[i][j-1]);
}
}
}
The critical part is filling up the score matrix and this is the part I'm trying to mostly parallelize.
One way to do it (which I chose) is: filling up the matrix by anti diagonals, so left, top and top-left dependecies are always satisfied. In a nutshell, I keep track of the diagonal (third loop, variable i below) and threads fill up that diagonal in parallel.
For this purpose, I've written this code:
void parallelCalculateLCS(int len1, int len2, char *seq1, char *seq2) {
int score[len1 + 1][len2 + 1];
int i, j, k, iam;
char *lcs = NULL;
for(i=0;i<len1+1;i++)
for(j=0;j<len2+1;j++)
score[i][j] = -1;
#pragma omp parallel default(shared) private(iam)
{
iam = omp_get_thread_num();
// Preparing first row and first column with zeros
#pragma omp for
for(j=0; j < (len2+1); j++)
score[0][j] = iam;
#pragma omp for
for(i=0; i < (len1+1); i++)
score[i][0] = iam;
// Calculating scores
for(i=1; i < (len1+1); i++) {
k=i;
#pragma omp for
for(j=1; j <= i; j++) {
if (seq1[k-1] == seq2[j-1]) {
// score[k][j] = score[k-1][j-1] + 1;
score[k][j] = iam;
}
else {
// score[k][j] = max(score[k-1][j], score[k][j-1]);
score[k][j] = iam;
}
#pragma omp atomic
k--;
}
}
}
}
The first two loops (first row and column) work correctly and threads fill up cells in a balanced way.
When it comes to fill up the matrix (diagonally), nothing works well. I tried to debug it, but it seems that threads act and write things randomly.
I can't figure out what's going wrong, since in the first two loops there were no problems at all.
Any idea?
P.S. I know that accessing matrix in a diagonal way is very cache-unfriendly and threads could be unbalanced, but I only need it to work by now.
P.S. #2 I don't know if it could be useful, but my CPU has up to 8 threads.
#pragma omp atomic means that the processors will perform the operation one at a time. You are looking for #pragma omp for private(k) : the processors will no longer share the same value. Bye, Francis
The following nested for loop
#pragma omp for
for(j=1; j <= i; j++)
will be executed in parallel, each thread with a different value of j in no specific order.
As nothing is specified in the omp for section, k will be shared by default between all threads. So depending on the order of the threads, k will be decremented at an unknown time (even with the omp atomic). So for a fixed j, the value of k might change during the execution of the body of the for loop (between the if clauses, ...).

OpenMP: synchronization inside parallel for

I have a code that reads like this
void h(particles *p) {
#pragma omp parallel for
for (int i = 0; i < maxThreads; ++i) {
int id = omp_get_thread_num();
for (int j = 0; j < dtnum; ++j) {
f( p, id);
if ( j % 50 == 0 ) {
if (id == 0) {
g(p);
}
#pragma omp barrier
}
}
}
}
void f(particles *p, int id) {
for (int i = id * prt_thread; i < (id + 1)*prt_thread; ++i) {
x(p[i]);
}
}
Basically I want to:
1)spawn a given amount of threads. each thread will process a chuck of p according to thread's id
2)each element of p must be processed dtnum times. The processing involve random events
3)every 50 iterations, one thread must perform another operation, while the other threads wait
Problem: gcc says warning: barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
what can I do?
It's hard to tell from the very schematic code, but if all you want to do is sync up every so many iterations, it seems easiest to pull the iteration loop out of the parallel omp for loop - which seems clearer anyway - and just do
const int iterblocks=50;
#pragma omp parallel shared(p, dtnum) default(none)
for (int jblock=0; jblock<dtnum/iterblocks; jblock++) {
for (int j=0; j<iterblocks; j++) {
#pragma omp for nowait
for (int i=0; i<prt; i++)
x(p[i]);
}
#pragma omp barrier
#pragma omp single
g(p);
#pragma omp barrier
}
I think your code is wrong. You said :
each element of p must be processed dtnum times.
But each element of p will be execute maxThreads*dtnum times.
Could you be more explicit on what your code's supposed to do ?

Resources