optimizing parallel loops with OpenMP using C - c

EDIT: changed the code and phrasing to make my doubt more explicit
I'm struggling to parallelize a loop in C using OpenMP for quite a while and want directions of how I should takle this challenge.
the loop consists of the following (in case you wish to know this loop is the main loop integrated in a simulated annealing algorithm):
for(attempt = 0; attempt < SATISFIED; attempt++) {
i = (rand() % (len-1)) + 1;
j = i + (rand() % (len-i));
if(...) {
...
//Update global static variables:
if(dst < best_distance)
set_best(dst, path);
//Stop this attempt:
attempt = -1;
}
//Decrease the temperature:
temp = change_temp(temp);
}
The problem with this loop is that the number of iterations to do cannot be calculated by it's condition so I came up with a different way to write this loop in order to be able to use openmp:
while(keepGoing){
keepGoing = 0;
#pragma omp parallel for default(none) shared(len, best_distance, best_path, distances, avg_distance, path) private( i, j, seed, swp_dst) lastprivate(dst, temp, keepGoing) firstprivate(dst, temp, abort, keepGoing)
for(attempt = 0; attempt < SATISFIED; attempt++) {
#pragma omp flush (abort)
if (!abort) {
seed = omp_get_thread_num();
i = (rand_r(&seed) % (len-1)) + 1;
j = i + (rand_r(&seed) % (len-i));
//Update progress:
#pragma omp critical
{
if(...) {
...
//Update global static variables:
if(dst < best_distance)
set_best(dst, path);
//Stop this attempt:
keepGoing = 1;
abort = 1;
#pragma omp flush (abort)
#pragma omp flush (keepGoing)
}
}
//Decrease the temperature:
temp = change_temp(temp);
}
}
}
However this solution gives a different then the sequential version I wrote before output for reasons I don't understand... Are the openmp directives being well placed? Or should I use them in different ways? Thanks in advance for any answer.

Related

How can I paralelize two for statements equally between threads using OpenMP?

Lets say I have the following code:
#pragma omp parallel for
for (i = 0; i < array.size; i++ ) {
int temp = array[i];
for (p = 0; p < array2.size; p++) {
array2[p] = array2[p] + temp;
How can I divide the array2.size between the threads that I call when I do the #pragma omp parallel for in the first line? For what I understood when I do the #pragma omp parallel for I'll spawn several threads in such a way that each thread will have a part of the array.size so that the i will never be the same between threads. But in this case I also want those same threads to have a different part of the array2.size (their p will also never be the same between them) so that I dont have all the threads doing the same calculation in the second for.
I've tried the collapse notation but it seems that this is only used for perfect for statements since I couldn't get the result I wanted.
Any help is appreciated! Thanks in advance
The problem with your code is that multiple threads will try to modify array2 at the same time (race condition). This can easily be avoided by reordering the loops. If array2.size doesn't provide enough parallelism, you may apply the collapse clause, as the loops are now in canonical form.
#pragma omp parallel for
for (p = 0; p < array2.size; p++) {
for (i = 0; i < array.size; i++ ) {
array2[p] += array[i];
}
}
You shouldn't expect too much of this though as the ratio between loads/stores and computation is very bad. This is without a doubt memory-bound and not compute-bound.
EDIT: If this is really your problem and not just a minimal example, I would also try the following:
#pragma omp parallel
{
double sum = 0.;
#pragma omp for reduction(+: sum)
for (i = 0; i < array.size; i++) {
sum += array[i];
}
#pragma omp for
for (p = 0; p < array2.size; p++) {
array2[p] += sum;
}
}

Mixing OpenMP and xmmintrin SSE Intrinsics - not getting speedup over the non-parallel version

I've implemented a version of the Travelling Salesman with xmmintrin.h SSE instructions, received a decent speedup. But now I'm also trying to implement OpenMP threading on top of it, and I'm seeing a pretty drastic slow down. I'm getting the correct answer in both cases (i.e. (i) with SSE only, or (ii) with SSE && OpenMP).
I know I am probably doing something wildly wrong, and maybe someone much more experienced than me can spot the issue.
The main loop of my program has the following (brief) pseudocode:
int currentNode;
for(int i = 0; i < numNodes; i++) {
minimumDistance = DBL_MAX;
minimumDistanceNode;
for(int j = 0; j < numNodes; j++) {
// find distance between 'currentNode' to j-th node
// ...
if(jthNodeDistance < minimumDistance) {
minimumDistance = jthNodeDistance;
minimumDistanceNode = jthNode;
}
}
currentNode = minimumDistanceNode;
}
And here is my implementation, that is still semi-pseudocode as I've still brushed over some parts that I don't think have an impact on performance, I think the issues to be found with my code can be found in the following code snippet. If you just omit the #pragma lines, then the following is pretty much identical to the SSE only version of the same program, so I figure I should only include the OpenMP version:
int currentNode = 0;
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++) {
miniumum = DBL_MAX;
__m128 currentNodeX = _mm_set1_ps(xCoordinates[currentNode]);
__m128 currentNodeY = _mm_set1_ps(yCoordinates[currentNode]);
#pragma omp parallel num_threads(omp_get_max_threads())
{
float localMinimum = DBL_MAX;
float localMinimumNode;
#pragma omp for
for (int j = 0; j < loopEnd; j += 4) {
// a number of SSE vector calculations to find distance
// between the current node and the four nodes we're looking
// at in this iteration of the loop:
__m128 subXs_0 = _mm_sub_ps(currentNodeX, _mm_load_ps(&xCoordinates[j]));
__m128 squareSubXs_0 = _mm_mul_ps(subXs_0, subXs_0);
__m128 subYs_0 = _mm_sub_ps(currentNodeY, _mm_load_ps(&yCoordinates[j]));
__m128 squareSubYs_0 = _mm_mul_ps(subYs_0, subYs_0);
__m128 addXY_0 = _mm_add_ps(squareSubXs_0, squareSubYs_0);
float temp[unroll];
_mm_store_ps(&temp[0], addXY_0);
// skipping stuff here that is about getting the minimum distance and
// it's equivalent node, don't think it's massively relevant but
// each thread will have its own
// localMinimum
// localMinimumNode
}
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
So my logic is:
omp single for the top-level for(int i) for loop, and I only want 1 thread dedicated to this
omp parallel num_threads(omp_get_max_threads()) for the inner for(int j) for-loop, as I want all cores working on this part of the code at the same time.
omp critical at the end of the full for(int j) loop, as I want to thread-safely update the current node.
In terms of run-time, the OpenMP version is typically twice as slow as the SSE-only version.
Does anything jump out at you as particularly bad in my code, that is causing this drastic slow-down for OpenMP?
Does anything jump out at you as particularly bad in my code, that is
causing this drastic slow-down for OpenMP?
First:
omp single for the top-level for(int i) for loop, and I only want 1
thread dedicated to this
In your code you have the following:
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++)
{
#pragma omp parallel num_threads(omp_get_max_threads())
{
//....
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
The #pragma omp parallel creates a team of threads, but then only one thread executes a parallel task (i.e., #pragma omp single) while the other threads don't do anything. You can simplified to:
for (int i = 1; i < totalNum; i++)
{
#pragma omp parallel num_threads(omp_get_max_threads())
{
//....
}
ThisPt = minimumNode;
}
The inner only is still executed by only one thread.
Second :
omp parallel num_threads(omp_get_max_threads()) for the inner for(int
j) for-loop, as I want all cores working on this part of the code at
the same time.
The problem is that this might return the number of logic-cores and not physical cores, and some codes might perform worse with hyper-threading. So, I would first test with a different number of threads, starting from 2, 4 and so on, until you find a number to which the code stops scaling.
omp critical at the end of the full for(int j) loop, as I want to
thread-safely update the current node.
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
this can be replaced by creating an array where each thread save its local minimum in a position reserved to that thread, and outside the parallel region the initial thread extract the minimum and minimumNode:
int total_threads = /..;
float localMinimum[total_threads] = {DBL_MAX};
float localMinimumNode[total_threads] = {DBL_MAX};
#pragma omp parallel num_threads(total_threads)
{
/...
}
for(int i = 0; i < total_threads; i++){
if (localMinimum[i] < minimum) {
minimum = localMinimum[i];
minimumNode = localMinimumNode[i];
}
}
Finally, after those changes are done, you try to check if it is possible to replace this parallelization by the following:
#pragma omp parallel for
for (int i = 1; i < totalNum; i++)
{
...
}

How to synchronize 3 nested loop in OpenMP?

I am writing a program that will match up one block(a group of 4 double numbers which are within certain absolute value) with another.
Essentially, I will call the function in main.
The matrix has 4399 rows and 500 columns.I am trying to use OpenMp to speed up the task yet my code seems to have race condition within the innermost loop (where the actual creation of block happens create_Block(rrr[k], i); ).
It is ok to ignore all the function detail as they are working well in serial version. The only focus here is the OpenMP derivatives.
int main(void) {
readKey("keys.txt");
double** jz = readMatrix("data.txt");
int j = 0;
int i = 0;
int k = 0;
#pragma omp parallel for firstprivate(i) shared(Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (i = 0; i < 50; i++) {
printf("THIS IS COLUMN %d\n", i);
double*c = readCol(jz, i, 4400);
#pragma omp parallel for firstprivate(j) shared(i,Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (j=0; j < 4400; j++) {
// printf("This is fixed row %d from column %d !!!!!!!!!!\n",j,i);
int* one_collection = collection(c, j, 4400);
// MODIFY THE DYMANIC ALLOCATION OF SPACES (SIZE_OF_COMBINATION) IN combNonRec() function.
if (get_combination_size(SIZE_OF_COLLECTION, M) >= 4) {
//GET THE 2D-ARRAY OF COMBINATION
int** rrr = combNonRec(one_collection, SIZE_OF_COLLECTION, M);
#pragma omp parallel for firstprivate(k) shared(i,j,Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (k = 0; k < get_combination_size(SIZE_OF_COLLECTION, M); k++) {
create_Block(rrr[k], i); //ACTUAL CREATION OF BLOCK !!!!!!!
printf("This is block %d \n", NUM_OF_BLOCK);
add_To_Block_Collection();
}
free(rrr);
}
free(one_collection);
}
//OpenMP for j
free(c);
}
// OpenMP for i
collision();
}
Here is the parallel version result: non-deterministic
Whereas the serial result has constant 400 blocks.
Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION are global variable.
Did I do anything wrong in the derivative declaration? What might have caused such problem?

OMP 2.0 Nested For Loops

As I'm unable to use omp tasks (using visual studio 2015) I'm trying to find a workaround for a nested loop task. The code is as follows:
#pragma omp parallel
{
for (i = 0; i < largeNum; i++)
{
#pragma omp single
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
}
#pragma omp for
for (n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
#pragma omp barrier
}
I want all my threads to execute largeNum times, but wait for the memset to be set to zero, and then i want the largeFunc be performed by each thread. There are no data dependencies that I have found.
I've got what the omp directives all jumbled in my head at this point. Does this solution work? Is there a better way to do without tasks?
Thanks!
What about just this code?
#pragma omp parallel private( i, n )
for ( i = 0; i < largeNum; i++ ) {
#pragma omp for
for ( n = 0; n < num; n++ ) {
results[n] = 0;
largeFunc( param[n], &results[n] );
}
}
As far as I understand your problem, the intialisation part should be taken care of without the need of the single directive, provided the actual type of results supports the assignment to 0. Moreover, your initial code was lacking of the private( i ) declaration. Finally, the barrier shouldn't be needed .
Why do you want all your threads to execute largeNUM ? do you then depend on index i inside your largeFunc in someway if yes
#pragma omp parallel for
for (int i = 0; i < largeNum; i++)
{
#pragma omp single
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
}
#pragma omp barrier
// #pragma omp for -- this is not needed since it has to be coarse on the outermost level. However if the below function does not have anything to do with the outer loop then see the next example
for (n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
}
If you do not depend on i then
for (i = 0; i < largeNum; i++)
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
#pragma omp parallel for
for (int n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
However I feel you want the first one. In general you parallelise on the outermost loop. Placing pragmas in the innerloop will slow your code down due to overheads if there is not enough work to be done.

using pragma in parallel for construct

In my VS2010 C code I am successfully using the pragma directive here:
void doSomething(void)
{
n = doSomethingElse();
j = doOnceMore();
k = n + j;
}
#pragma omp parallel for
for (i = 0; i < 5; ++i)
{
doSomething();
}
But I cannot get it to work if I move the work of "doSomething()" inline:
#pragma omp parallel for
for (int i = 0; i < 5; ++i)
{
n = doSomethingElse();
j = doOnceMore();
k = n + j;
}
I always assumed that the pragma directive would take the stuff inside the brackets and assign it a unique thread. Am I dead wrong about that, or is there some other omp syntax I should use?
n,j,k are by default thread shared, hence it does not work. Every thread is writing on n,j,k at the same time, currently.
It depends what you want to do whether they have to be private or shared though. If they are local for one loop pass, you can declare them as thread private and it should work fine (the loop counter, here i, is by default thread private).
#pragma omp parallel for private(n,j,k)
for (int i = 0; i < 5; ++i)
{
n = doSomethingElse();
j = doOnceMore();
k = n + j;
}
Since openmp cannot guess what purpose variables have, it is your job to tell the pragma how to handle them. You find more information about Clauses and Variables here. There is also a very good talk on the openmp webpage about data structures and constructing proper parallel regions.

Resources