OpenMP Array Conversion - arrays

I need to perform the array operation within the nested loop. The function dosomething() requires it in this fashion. How can I do this correctly?
#pragma omp parallel
{
#pragma omp for ordered
for (int i = 1; i < iter; i++)
{
#pragma omp ordered
for (int j = 0; j < rows; j++)
{
info2[j][0] = info[j];
}
dosomething(info2);
}
}
This is my problem: without the ordered directives, the array info2 is incorrect every time. However, using the ordered directives takes a fair amount more time than with everything in serial. What is a better way of doing this?

Related

Mixing OpenMP and xmmintrin SSE Intrinsics - not getting speedup over the non-parallel version

I've implemented a version of the Travelling Salesman with xmmintrin.h SSE instructions, received a decent speedup. But now I'm also trying to implement OpenMP threading on top of it, and I'm seeing a pretty drastic slow down. I'm getting the correct answer in both cases (i.e. (i) with SSE only, or (ii) with SSE && OpenMP).
I know I am probably doing something wildly wrong, and maybe someone much more experienced than me can spot the issue.
The main loop of my program has the following (brief) pseudocode:
int currentNode;
for(int i = 0; i < numNodes; i++) {
minimumDistance = DBL_MAX;
minimumDistanceNode;
for(int j = 0; j < numNodes; j++) {
// find distance between 'currentNode' to j-th node
// ...
if(jthNodeDistance < minimumDistance) {
minimumDistance = jthNodeDistance;
minimumDistanceNode = jthNode;
}
}
currentNode = minimumDistanceNode;
}
And here is my implementation, that is still semi-pseudocode as I've still brushed over some parts that I don't think have an impact on performance, I think the issues to be found with my code can be found in the following code snippet. If you just omit the #pragma lines, then the following is pretty much identical to the SSE only version of the same program, so I figure I should only include the OpenMP version:
int currentNode = 0;
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++) {
miniumum = DBL_MAX;
__m128 currentNodeX = _mm_set1_ps(xCoordinates[currentNode]);
__m128 currentNodeY = _mm_set1_ps(yCoordinates[currentNode]);
#pragma omp parallel num_threads(omp_get_max_threads())
{
float localMinimum = DBL_MAX;
float localMinimumNode;
#pragma omp for
for (int j = 0; j < loopEnd; j += 4) {
// a number of SSE vector calculations to find distance
// between the current node and the four nodes we're looking
// at in this iteration of the loop:
__m128 subXs_0 = _mm_sub_ps(currentNodeX, _mm_load_ps(&xCoordinates[j]));
__m128 squareSubXs_0 = _mm_mul_ps(subXs_0, subXs_0);
__m128 subYs_0 = _mm_sub_ps(currentNodeY, _mm_load_ps(&yCoordinates[j]));
__m128 squareSubYs_0 = _mm_mul_ps(subYs_0, subYs_0);
__m128 addXY_0 = _mm_add_ps(squareSubXs_0, squareSubYs_0);
float temp[unroll];
_mm_store_ps(&temp[0], addXY_0);
// skipping stuff here that is about getting the minimum distance and
// it's equivalent node, don't think it's massively relevant but
// each thread will have its own
// localMinimum
// localMinimumNode
}
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
So my logic is:
omp single for the top-level for(int i) for loop, and I only want 1 thread dedicated to this
omp parallel num_threads(omp_get_max_threads()) for the inner for(int j) for-loop, as I want all cores working on this part of the code at the same time.
omp critical at the end of the full for(int j) loop, as I want to thread-safely update the current node.
In terms of run-time, the OpenMP version is typically twice as slow as the SSE-only version.
Does anything jump out at you as particularly bad in my code, that is causing this drastic slow-down for OpenMP?
Does anything jump out at you as particularly bad in my code, that is
causing this drastic slow-down for OpenMP?
First:
omp single for the top-level for(int i) for loop, and I only want 1
thread dedicated to this
In your code you have the following:
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++)
{
#pragma omp parallel num_threads(omp_get_max_threads())
{
//....
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
The #pragma omp parallel creates a team of threads, but then only one thread executes a parallel task (i.e., #pragma omp single) while the other threads don't do anything. You can simplified to:
for (int i = 1; i < totalNum; i++)
{
#pragma omp parallel num_threads(omp_get_max_threads())
{
//....
}
ThisPt = minimumNode;
}
The inner only is still executed by only one thread.
Second :
omp parallel num_threads(omp_get_max_threads()) for the inner for(int
j) for-loop, as I want all cores working on this part of the code at
the same time.
The problem is that this might return the number of logic-cores and not physical cores, and some codes might perform worse with hyper-threading. So, I would first test with a different number of threads, starting from 2, 4 and so on, until you find a number to which the code stops scaling.
omp critical at the end of the full for(int j) loop, as I want to
thread-safely update the current node.
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
this can be replaced by creating an array where each thread save its local minimum in a position reserved to that thread, and outside the parallel region the initial thread extract the minimum and minimumNode:
int total_threads = /..;
float localMinimum[total_threads] = {DBL_MAX};
float localMinimumNode[total_threads] = {DBL_MAX};
#pragma omp parallel num_threads(total_threads)
{
/...
}
for(int i = 0; i < total_threads; i++){
if (localMinimum[i] < minimum) {
minimum = localMinimum[i];
minimumNode = localMinimumNode[i];
}
}
Finally, after those changes are done, you try to check if it is possible to replace this parallelization by the following:
#pragma omp parallel for
for (int i = 1; i < totalNum; i++)
{
...
}

For inside for - how to do inner for parallel without spending time on creating threads

I'm new in OpenMP and
I'm facing situation like this:
int someArray[ARRAY_SIZE];
//outer loop
for(int i = 0; i < 100; ++i) {
//inner loop
for(int j = 0; i < ARRAY_SIZE; ++i) {
//calculaations in someArray (every cell can be calculated separately)
}
//some code that needs to be run by only one thread - for example sorting someArray
}
I want to make inner loop parallel, but idea that I tried (code below) is not effective (single thread can do things faster than multiple threads). I think that creating multiple threads over and over waists a lot of time here.
My bad solution:
int someArray[ARRAY_SIZE];
//outer loop
for(int i = 0; i < 100; ++i) {
#pragma omp parallel num_threads(THREADS_NUMBER) shared(someArray)
{
//inner loop
#pragma omp for
for(int j = 0; i < ARRAY_SIZE; ++i) {
//calculaations in someArray (every cell can be calculated separately)
}
}
//some code that needs to be run by only one thread - for example sorting someArray
}
Do you have any idea how to optimise this task?
When you have double for loops, you almost always want to parallize the outer loop. In your case:
#pragma omp parallel for
for(int i = 0; i < 100; ++i) {
for(int j = 0; i < ARRAY_SIZE; ++i) {
//calculations in someArray (every cell can be calculated separately)
}
//some code that needs to be run by only one thread - for example sorting someArray
}
If you have 4 CPUs available, this will split the 100 iterations into 25 across the 4 CPUs. This is much more efficient than your code, which ends up, for each of the 100 iteration, splitting ARRAY_SIZE across the CPUs (you thus has 100x the overhead).

OMP 2.0 Nested For Loops

As I'm unable to use omp tasks (using visual studio 2015) I'm trying to find a workaround for a nested loop task. The code is as follows:
#pragma omp parallel
{
for (i = 0; i < largeNum; i++)
{
#pragma omp single
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
}
#pragma omp for
for (n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
#pragma omp barrier
}
I want all my threads to execute largeNum times, but wait for the memset to be set to zero, and then i want the largeFunc be performed by each thread. There are no data dependencies that I have found.
I've got what the omp directives all jumbled in my head at this point. Does this solution work? Is there a better way to do without tasks?
Thanks!
What about just this code?
#pragma omp parallel private( i, n )
for ( i = 0; i < largeNum; i++ ) {
#pragma omp for
for ( n = 0; n < num; n++ ) {
results[n] = 0;
largeFunc( param[n], &results[n] );
}
}
As far as I understand your problem, the intialisation part should be taken care of without the need of the single directive, provided the actual type of results supports the assignment to 0. Moreover, your initial code was lacking of the private( i ) declaration. Finally, the barrier shouldn't be needed .
Why do you want all your threads to execute largeNUM ? do you then depend on index i inside your largeFunc in someway if yes
#pragma omp parallel for
for (int i = 0; i < largeNum; i++)
{
#pragma omp single
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
}
#pragma omp barrier
// #pragma omp for -- this is not needed since it has to be coarse on the outermost level. However if the below function does not have anything to do with the outer loop then see the next example
for (n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
}
If you do not depend on i then
for (i = 0; i < largeNum; i++)
{
//Some code to be run by a single thread
memset(results, 0, num * sizeof(results[0]));
#pragma omp parallel for
for (int n = 0; n < num; n++) {
//Call to my function
largeFunc(params[n], &resulsts[n])
}
}
However I feel you want the first one. In general you parallelise on the outermost loop. Placing pragmas in the innerloop will slow your code down due to overheads if there is not enough work to be done.

Longest Common Subsequence with openMP

I'm writing a parallel version of the Longest Common Subsequence algorithm using openMP.
The sequential version is the following (and it works correctly):
// Preparing first row and first column with zeros
for(j=0; j < (len2+1); j++)
score[0][j] = 0;
for(i=0; i < (len1+1); i++)
score[i][0] = 0;
// Calculating scores
for(i=1; i < (len1+1); i++) {
for(j=1; j < (len2+1) ;j++) {
if (seq1[i-1] == seq2[j-1]) {
score[i][j] = score[i-1][j-1] + 1;
}
else {
score[i][j] = max(score[i-1][j], score[i][j-1]);
}
}
}
The critical part is filling up the score matrix and this is the part I'm trying to mostly parallelize.
One way to do it (which I chose) is: filling up the matrix by anti diagonals, so left, top and top-left dependecies are always satisfied. In a nutshell, I keep track of the diagonal (third loop, variable i below) and threads fill up that diagonal in parallel.
For this purpose, I've written this code:
void parallelCalculateLCS(int len1, int len2, char *seq1, char *seq2) {
int score[len1 + 1][len2 + 1];
int i, j, k, iam;
char *lcs = NULL;
for(i=0;i<len1+1;i++)
for(j=0;j<len2+1;j++)
score[i][j] = -1;
#pragma omp parallel default(shared) private(iam)
{
iam = omp_get_thread_num();
// Preparing first row and first column with zeros
#pragma omp for
for(j=0; j < (len2+1); j++)
score[0][j] = iam;
#pragma omp for
for(i=0; i < (len1+1); i++)
score[i][0] = iam;
// Calculating scores
for(i=1; i < (len1+1); i++) {
k=i;
#pragma omp for
for(j=1; j <= i; j++) {
if (seq1[k-1] == seq2[j-1]) {
// score[k][j] = score[k-1][j-1] + 1;
score[k][j] = iam;
}
else {
// score[k][j] = max(score[k-1][j], score[k][j-1]);
score[k][j] = iam;
}
#pragma omp atomic
k--;
}
}
}
}
The first two loops (first row and column) work correctly and threads fill up cells in a balanced way.
When it comes to fill up the matrix (diagonally), nothing works well. I tried to debug it, but it seems that threads act and write things randomly.
I can't figure out what's going wrong, since in the first two loops there were no problems at all.
Any idea?
P.S. I know that accessing matrix in a diagonal way is very cache-unfriendly and threads could be unbalanced, but I only need it to work by now.
P.S. #2 I don't know if it could be useful, but my CPU has up to 8 threads.
#pragma omp atomic means that the processors will perform the operation one at a time. You are looking for #pragma omp for private(k) : the processors will no longer share the same value. Bye, Francis
The following nested for loop
#pragma omp for
for(j=1; j <= i; j++)
will be executed in parallel, each thread with a different value of j in no specific order.
As nothing is specified in the omp for section, k will be shared by default between all threads. So depending on the order of the threads, k will be decremented at an unknown time (even with the omp atomic). So for a fixed j, the value of k might change during the execution of the body of the for loop (between the if clauses, ...).

OpenMP: synchronization inside parallel for

I have a code that reads like this
void h(particles *p) {
#pragma omp parallel for
for (int i = 0; i < maxThreads; ++i) {
int id = omp_get_thread_num();
for (int j = 0; j < dtnum; ++j) {
f( p, id);
if ( j % 50 == 0 ) {
if (id == 0) {
g(p);
}
#pragma omp barrier
}
}
}
}
void f(particles *p, int id) {
for (int i = id * prt_thread; i < (id + 1)*prt_thread; ++i) {
x(p[i]);
}
}
Basically I want to:
1)spawn a given amount of threads. each thread will process a chuck of p according to thread's id
2)each element of p must be processed dtnum times. The processing involve random events
3)every 50 iterations, one thread must perform another operation, while the other threads wait
Problem: gcc says warning: barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region
what can I do?
It's hard to tell from the very schematic code, but if all you want to do is sync up every so many iterations, it seems easiest to pull the iteration loop out of the parallel omp for loop - which seems clearer anyway - and just do
const int iterblocks=50;
#pragma omp parallel shared(p, dtnum) default(none)
for (int jblock=0; jblock<dtnum/iterblocks; jblock++) {
for (int j=0; j<iterblocks; j++) {
#pragma omp for nowait
for (int i=0; i<prt; i++)
x(p[i]);
}
#pragma omp barrier
#pragma omp single
g(p);
#pragma omp barrier
}
I think your code is wrong. You said :
each element of p must be processed dtnum times.
But each element of p will be execute maxThreads*dtnum times.
Could you be more explicit on what your code's supposed to do ?

Resources