OpenMP: synchronization inside parallel for - c

I have a code that reads like this
void h(particles *p) {
#pragma omp parallel for
for (int i = 0; i < maxThreads; ++i) {
int id = omp_get_thread_num();
for (int j = 0; j < dtnum; ++j) {
f( p, id);
if ( j % 50 == 0 ) {
if (id == 0) {
g(p);
}
#pragma omp barrier
}
}
}
}
void f(particles *p, int id) {
for (int i = id * prt_thread; i < (id + 1)*prt_thread; ++i) {
x(p[i]);
}
}
Basically I want to:
1)spawn a given amount of threads. each thread will process a chuck of p according to thread's id
2)each element of p must be processed dtnum times. The processing involve random events
3)every 50 iterations, one thread must perform another operation, while the other threads wait
Problem: gcc says warning: barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
what can I do?

It's hard to tell from the very schematic code, but if all you want to do is sync up every so many iterations, it seems easiest to pull the iteration loop out of the parallel omp for loop - which seems clearer anyway - and just do
const int iterblocks=50;
#pragma omp parallel shared(p, dtnum) default(none)
for (int jblock=0; jblock<dtnum/iterblocks; jblock++) {
for (int j=0; j<iterblocks; j++) {
#pragma omp for nowait
for (int i=0; i<prt; i++)
x(p[i]);
}
#pragma omp barrier
#pragma omp single
g(p);
#pragma omp barrier
}

I think your code is wrong. You said :
each element of p must be processed dtnum times.
But each element of p will be execute maxThreads*dtnum times.
Could you be more explicit on what your code's supposed to do ?

Related

Mixing OpenMP and xmmintrin SSE Intrinsics - not getting speedup over the non-parallel version

I've implemented a version of the Travelling Salesman with xmmintrin.h SSE instructions, received a decent speedup. But now I'm also trying to implement OpenMP threading on top of it, and I'm seeing a pretty drastic slow down. I'm getting the correct answer in both cases (i.e. (i) with SSE only, or (ii) with SSE && OpenMP).
I know I am probably doing something wildly wrong, and maybe someone much more experienced than me can spot the issue.
The main loop of my program has the following (brief) pseudocode:
int currentNode;
for(int i = 0; i < numNodes; i++) {
minimumDistance = DBL_MAX;
minimumDistanceNode;
for(int j = 0; j < numNodes; j++) {
// find distance between 'currentNode' to j-th node
// ...
if(jthNodeDistance < minimumDistance) {
minimumDistance = jthNodeDistance;
minimumDistanceNode = jthNode;
}
}
currentNode = minimumDistanceNode;
}
And here is my implementation, that is still semi-pseudocode as I've still brushed over some parts that I don't think have an impact on performance, I think the issues to be found with my code can be found in the following code snippet. If you just omit the #pragma lines, then the following is pretty much identical to the SSE only version of the same program, so I figure I should only include the OpenMP version:
int currentNode = 0;
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++) {
miniumum = DBL_MAX;
__m128 currentNodeX = _mm_set1_ps(xCoordinates[currentNode]);
__m128 currentNodeY = _mm_set1_ps(yCoordinates[currentNode]);
#pragma omp parallel num_threads(omp_get_max_threads())
{
float localMinimum = DBL_MAX;
float localMinimumNode;
#pragma omp for
for (int j = 0; j < loopEnd; j += 4) {
// a number of SSE vector calculations to find distance
// between the current node and the four nodes we're looking
// at in this iteration of the loop:
__m128 subXs_0 = _mm_sub_ps(currentNodeX, _mm_load_ps(&xCoordinates[j]));
__m128 squareSubXs_0 = _mm_mul_ps(subXs_0, subXs_0);
__m128 subYs_0 = _mm_sub_ps(currentNodeY, _mm_load_ps(&yCoordinates[j]));
__m128 squareSubYs_0 = _mm_mul_ps(subYs_0, subYs_0);
__m128 addXY_0 = _mm_add_ps(squareSubXs_0, squareSubYs_0);
float temp[unroll];
_mm_store_ps(&temp[0], addXY_0);
// skipping stuff here that is about getting the minimum distance and
// it's equivalent node, don't think it's massively relevant but
// each thread will have its own
// localMinimum
// localMinimumNode
}
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
So my logic is:
omp single for the top-level for(int i) for loop, and I only want 1 thread dedicated to this
omp parallel num_threads(omp_get_max_threads()) for the inner for(int j) for-loop, as I want all cores working on this part of the code at the same time.
omp critical at the end of the full for(int j) loop, as I want to thread-safely update the current node.
In terms of run-time, the OpenMP version is typically twice as slow as the SSE-only version.
Does anything jump out at you as particularly bad in my code, that is causing this drastic slow-down for OpenMP?
Does anything jump out at you as particularly bad in my code, that is
causing this drastic slow-down for OpenMP?
First:
omp single for the top-level for(int i) for loop, and I only want 1
thread dedicated to this
In your code you have the following:
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++)
{
#pragma omp parallel num_threads(omp_get_max_threads())
{
//....
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
The #pragma omp parallel creates a team of threads, but then only one thread executes a parallel task (i.e., #pragma omp single) while the other threads don't do anything. You can simplified to:
for (int i = 1; i < totalNum; i++)
{
#pragma omp parallel num_threads(omp_get_max_threads())
{
//....
}
ThisPt = minimumNode;
}
The inner only is still executed by only one thread.
Second :
omp parallel num_threads(omp_get_max_threads()) for the inner for(int
j) for-loop, as I want all cores working on this part of the code at
the same time.
The problem is that this might return the number of logic-cores and not physical cores, and some codes might perform worse with hyper-threading. So, I would first test with a different number of threads, starting from 2, 4 and so on, until you find a number to which the code stops scaling.
omp critical at the end of the full for(int j) loop, as I want to
thread-safely update the current node.
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
this can be replaced by creating an array where each thread save its local minimum in a position reserved to that thread, and outside the parallel region the initial thread extract the minimum and minimumNode:
int total_threads = /..;
float localMinimum[total_threads] = {DBL_MAX};
float localMinimumNode[total_threads] = {DBL_MAX};
#pragma omp parallel num_threads(total_threads)
{
/...
}
for(int i = 0; i < total_threads; i++){
if (localMinimum[i] < minimum) {
minimum = localMinimum[i];
minimumNode = localMinimumNode[i];
}
}
Finally, after those changes are done, you try to check if it is possible to replace this parallelization by the following:
#pragma omp parallel for
for (int i = 1; i < totalNum; i++)
{
...
}

OpenMP Parallelize code inside a for loop

I want to parallelize tasks inside a for loop using OpenMP. However, I do not want to use #pragma omp parallel for as the result of the (i+1)th iteration depends on the output of the (i)th iteration. I have tried to spawn the threads inside the code, but the time of creating and destroying them every time is very high. An abstract description of my code is:
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
for (int i=0; i<1000; i++)
{
a_new = fun(a_old); //fun() depends only on the value of the argument
a_old = a_new;
b_new = fun(b_old);
b_old = b_new;
c_new = fun(c_old);
c_old = c_new;
d_new = fun(d_old);
d_old = d_new;
}
How can I efficiently use threads to calculate the new values of a_new, b_new, c_new, d_new in parallel in each iteration ?
Just don't parallelize the code inside the for loop - move the parallel region to the outside. This reduces the thread creation and worksharing overhead. Then you can easily apply OpenMP sections:
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
#pragma omp parallel sections
{
#pragma omp section
for (int i=0; i<1000; i++) {
a_new = fun(a_old); //fun() depends only on the value of the argument
a_old = a_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
b_new = fun(b_old);
b_old = b_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
c_new = fun(c_old);
c_old = c_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
d_new = fun(d_old);
d_old = d_new;
}
}
There is also another simplification:
int value[4];
#pragma omp parallel for
for (int abcd = 0; abcd < 4; abcd++) {
for (int i=0; i<1000; i++) {
value[abcd] = fun(value[abcd]);
}
}
In either case, you might want to consider adding padding between the values to avoid false sharing if fun executes rather quickly.
This pretty straight forward, as #kbr mentioned in the comments each of the calculation a,b,c and d are independent, so you can separate them to different threads and pass the corresponding value as parameter. The sample code looks like this.
#include<stdio.h>
#include <pthread.h>
void *thread_func(int *i)
{
for (int j=0; j<1000; j++)
{
//Instead of increment u can call whichever function you want here.
(*i)++;
}
}
int main()
{
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
pthread_t thread[4];
pthread_create(&thread[0],0,thread_func,&a_old);
pthread_create(&thread[1],0,thread_func,&b_old);
pthread_create(&thread[2],0,thread_func,&c_old);
pthread_create(&thread[3],0,thread_func,&d_old);
pthread_join(&thread[0],NULL);
pthread_join(&thread[1],NULL);
pthread_join(&thread[2],NULL);
pthread_join(&thread[3],NULL);
printf("a_old %d",a_old);
printf("b_old %d",b_old);
printf("c_old %d",c_old);
printf("d_old %d",d_old);
}

OMP 2.0 Nested For Loops

As I'm unable to use omp tasks (using visual studio 2015) I'm trying to find a workaround for a nested loop task. The code is as follows:
#pragma omp parallel
{
for (i = 0; i < largeNum; i++)
{
#pragma omp single
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
}
#pragma omp for
for (n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
#pragma omp barrier
}
I want all my threads to execute largeNum times, but wait for the memset to be set to zero, and then i want the largeFunc be performed by each thread. There are no data dependencies that I have found.
I've got what the omp directives all jumbled in my head at this point. Does this solution work? Is there a better way to do without tasks?
Thanks!
What about just this code?
#pragma omp parallel private( i, n )
for ( i = 0; i < largeNum; i++ ) {
#pragma omp for
for ( n = 0; n < num; n++ ) {
results[n] = 0;
largeFunc( param[n], &results[n] );
}
}
As far as I understand your problem, the intialisation part should be taken care of without the need of the single directive, provided the actual type of results supports the assignment to 0. Moreover, your initial code was lacking of the private( i ) declaration. Finally, the barrier shouldn't be needed .
Why do you want all your threads to execute largeNUM ? do you then depend on index i inside your largeFunc in someway if yes
#pragma omp parallel for
for (int i = 0; i < largeNum; i++)
{
#pragma omp single
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
}
#pragma omp barrier
// #pragma omp for -- this is not needed since it has to be coarse on the outermost level. However if the below function does not have anything to do with the outer loop then see the next example
for (n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
}
If you do not depend on i then
for (i = 0; i < largeNum; i++)
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
#pragma omp parallel for
for (int n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
However I feel you want the first one. In general you parallelise on the outermost loop. Placing pragmas in the innerloop will slow your code down due to overheads if there is not enough work to be done.

OpenMP Array Conversion

I need to perform the array operation within the nested loop. The function dosomething() requires it in this fashion. How can I do this correctly?
#pragma omp parallel
{
#pragma omp for ordered
for (int i = 1; i < iter; i++)
{
#pragma omp ordered
for (int j = 0; j < rows; j++)
{
info2[j][0] = info[j];
}
dosomething(info2);
}
}
This is my problem: without the ordered directives, the array info2 is incorrect every time. However, using the ordered directives takes a fair amount more time than with everything in serial. What is a better way of doing this?

Longest Common Subsequence with openMP

I'm writing a parallel version of the Longest Common Subsequence algorithm using openMP.
The sequential version is the following (and it works correctly):
// Preparing first row and first column with zeros
for(j=0; j < (len2+1); j++)
score[0][j] = 0;
for(i=0; i < (len1+1); i++)
score[i][0] = 0;
// Calculating scores
for(i=1; i < (len1+1); i++) {
for(j=1; j < (len2+1) ;j++) {
if (seq1[i-1] == seq2[j-1]) {
score[i][j] = score[i-1][j-1] + 1;
}
else {
score[i][j] = max(score[i-1][j], score[i][j-1]);
}
}
}
The critical part is filling up the score matrix and this is the part I'm trying to mostly parallelize.
One way to do it (which I chose) is: filling up the matrix by anti diagonals, so left, top and top-left dependecies are always satisfied. In a nutshell, I keep track of the diagonal (third loop, variable i below) and threads fill up that diagonal in parallel.
For this purpose, I've written this code:
void parallelCalculateLCS(int len1, int len2, char *seq1, char *seq2) {
int score[len1 + 1][len2 + 1];
int i, j, k, iam;
char *lcs = NULL;
for(i=0;i<len1+1;i++)
for(j=0;j<len2+1;j++)
score[i][j] = -1;
#pragma omp parallel default(shared) private(iam)
{
iam = omp_get_thread_num();
// Preparing first row and first column with zeros
#pragma omp for
for(j=0; j < (len2+1); j++)
score[0][j] = iam;
#pragma omp for
for(i=0; i < (len1+1); i++)
score[i][0] = iam;
// Calculating scores
for(i=1; i < (len1+1); i++) {
k=i;
#pragma omp for
for(j=1; j <= i; j++) {
if (seq1[k-1] == seq2[j-1]) {
// score[k][j] = score[k-1][j-1] + 1;
score[k][j] = iam;
}
else {
// score[k][j] = max(score[k-1][j], score[k][j-1]);
score[k][j] = iam;
}
#pragma omp atomic
k--;
}
}
}
}
The first two loops (first row and column) work correctly and threads fill up cells in a balanced way.
When it comes to fill up the matrix (diagonally), nothing works well. I tried to debug it, but it seems that threads act and write things randomly.
I can't figure out what's going wrong, since in the first two loops there were no problems at all.
Any idea?
P.S. I know that accessing matrix in a diagonal way is very cache-unfriendly and threads could be unbalanced, but I only need it to work by now.
P.S. #2 I don't know if it could be useful, but my CPU has up to 8 threads.
#pragma omp atomic means that the processors will perform the operation one at a time. You are looking for #pragma omp for private(k) : the processors will no longer share the same value. Bye, Francis
The following nested for loop
#pragma omp for
for(j=1; j <= i; j++)
will be executed in parallel, each thread with a different value of j in no specific order.
As nothing is specified in the omp for section, k will be shared by default between all threads. So depending on the order of the threads, k will be decremented at an unknown time (even with the omp atomic). So for a fixed j, the value of k might change during the execution of the body of the for loop (between the if clauses, ...).

Resources