I have a situation that I need to repeat a specific iteration of the loop multiple times. So, in that specific iteration, I am reducing the index one step so that next increment of the loop index makes no difference.
This approach, which is the approach I have to implement, works for multi-threaded OpenMP codes. However, it does not work for OpenACC (for both multicore and tesla targets). I get the following error:
Floating point exception (core dumped)
Here is the code for both of cases:
#include <stdio.h>
#include <omp.h>
#include <unistd.h>
int main() {
int x = 52;
int count = 5;
int i;
omp_set_num_threads(6);
#pragma omp parallel for
for(i=0;i<100;i++) {
if(i == x) {
printf("%d\n", i);
i--;
count--;
if(count == 0)
x = 10000;
}
}
int gpu_count = 0;
count = 5;
x = 52;
#pragma acc parallel loop independent
for(i=0;i<1000000;i++) {
if(i == x) {
#pragma acc atomic
gpu_count++;
i--;
count--;
if(count == 0)
x = 2000000;
}
}
printf("gpu_count: %d\n", gpu_count);
return 0;
}
For OpenMP, I get the correct output:
52
52
52
52
52
But, for the OpenACC, I get the abovementioned error.
If I comment line 35 (i--;), the code will be executed correctly and it will output number of repeated iterations (which is 1).
Note: I am using PGI 16.5 with Geforce GTX 970 and CUDA 7.5.
I compile with PGI compiler like following:
pgcc -mp -acc -ta=multicore -g f1.c
So, my question is: why I see such a behavior? Can't I change the loop index variable in OpenACC?
Your OpenMP version is in error. You're relying on a static schedule where the chunk size is larger than "count". If you increase the number of OMP threads so the chunk size is smaller than count, or if you change the schedule to interleave the chunks (i.e. "schedule(static,1)"), then you'll get wrong answers. There's also race conditions on "x" and "count".
Note that OpenACC scheduling is more like OpenMP "static,1" so that vectors can access contiguous blocks of memory across a worker (aka a CUDA warp). So your algorithm wont work here as well.
Also, by using the "independent" clause (which is implied when using "parallel loop"), you are asserting to the compiler that this loop does not contain dependencies or that the user will handle them via the "atomic" directive. However, changing the loop index variable inside the body of the loop will create a loop dependency since the value of the loop index depends on if the previous iteration changed it's value.
Edit: Below is an example which is a parallelizable version of your code.
% cat test2.c
#include <stdio.h>
#include <omp.h>
#include <unistd.h>
int main() {
int x = 52;
int count = 5;
int i;
int mycnt;
#pragma omp parallel for schedule(static,1) private(mycnt)
for(i=0;i<100;i++) {
if(i == x) {
mycnt = count;
while(mycnt > 0) {
printf("%d\n", i);
mycnt--;
}
}
}
#ifdef _OPENACC
int gpu_count = 0;
#pragma acc parallel loop reduction(+:gpu_count)
for(i=0;i<1000000;i++) {
if(i == x) {
mycnt = count;
while(mycnt > 0) {
gpu_count++;
mycnt--;
}
}
}
printf("gpu_count: %d\n", gpu_count);
#endif
return 0;
}
% pgcc -fast -mp -acc test2.c -Minfo=mp,acc
main:
13, Parallel region activated
Parallel loop activated with static cyclic schedule
24, Barrier
Parallel region terminated
25, Accelerator kernel generated
Generating Tesla code
25, Generating reduction(+:gpu_count)
26, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
29, #pragma acc loop seq
29, Loop carried scalar dependence for gpu_count at line 30
% a.out
52
52
52
52
52
gpu_count: 5
Related
I'd like to run something like the following:
for (int index = 0; index < num; index++)
I'd want to run the for loop with four threads, with the threads executing in the order: 0,1,2,3,4,5,6,7,8, etc...
That is, for the threads to be working on index =n,(n+1),(n+2),(n+3) (in any particular ordering but always in this pattern), I want iterations of index = 0,1,2,...(n-1) to already be finished.
Is there a way to do this? Ordered doesn't really work here as making the body an ordered section would basically remove all parallelism for me, and scheduling doesn't seem to work because I don't want a thread to be working on threads k->k+index/4.
Thanks for any help!
You can do this with, not a parallel for loop, but a parallel region that manages its own loop inside, plus a barrier to make sure all running threads have hit the same point in it before being able to continue. Example:
#include <stdatomic.h>
#include <stdio.h>
#include <omp.h>
int main()
{
atomic_int chunk = 0;
int num = 12;
int nthreads = 4;
omp_set_num_threads(nthreads);
#pragma omp parallel shared(chunk, num, nthreads)
{
for (int index; (index = atomic_fetch_add(&chunk, 1)) < num; ) {
printf("In index %d\n", index);
fflush(stdout);
#pragma omp barrier
// For illustrative purposes only; not needed in real code
#pragma omp single
{
puts("After barrier");
fflush(stdout);
}
}
}
puts("Done");
return 0;
}
One possible output:
$ gcc -std=c11 -O -fopenmp -Wall -Wextra demo.c
$ ./a.out
In index 2
In index 3
In index 1
In index 0
After barrier
In index 4
In index 6
In index 5
In index 7
After barrier
In index 10
In index 9
In index 8
In index 11
After barrier
Done
I'm not sure I understand your request correctly. If I try to summarize how I interpret it, that would be something like: "I want 4 threads sharing the iterations of a loop, with always the 4 threads running at most on 4 consecutive iterations of the loop".
If that's what you want, what about something like this:
int nths = 4;
#pragma omp parallel num_thread( nths )
for( int index_outer = 0; index_outer < num; index_outer += nths ) {
int end = min( index_outer + nths, num );
#pragma omp for
for( int index = index_outer; index < end; index++ ) {
// the loop body just as before
} // there's a thread synchronization here
}
I'm learning OpenMP these days and I just met the "threadprivate" directive. The code snippet below written by myself didn't output the expected result:
// **** File: fun.h **** //
void seed(int x);
int drand();
// ********************* //
// **** File: fun.c **** //
extern int num;
int drand()
{
num = num + 1;
return num;
}
void seed(int num_initial)
{
num = num_initial;
}
// ************************ //
// **** File: main.c **** //
#include "fun.h"
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int num = 0;
#pragma omp threadprivate(num)
int main()
{
int num_inital = 4;
seed(num_inital);
printf("At the beginning, num = %d\n", num); // should num be 4?
#pragma omp parallel for num_threads(2) schedule(static,1) copyin(num)
for (int ii = 0; ii < 4; ii++) {
int my_rank = omp_get_thread_num();
//printf("Before processing, in thread %d num = %d\n", my_rank,num);
int num_in_loop = drand();
printf("Thread %d is processing loop %d: num = %d\n", my_rank,ii, num_in_loop);
}
system("pause");
return 0;
}
// ********************* //
Here list my questions:
Why the result of printf("At the beginning, num = %d\n", num); is num = 0 instead of num = 4?
As for the parallel for loop, multiple executions produce different results one of which is:
Thread 1 is processing loop 1: num = 5
Thread 0 is processing loop 0: num = 6
Thread 1 is processing loop 3: num = 7
Thread 0 is processing loop 2: num = 8
It seems that num is initialized to 4 in the for loop which denotes that the num in copyin clause is equal to 4. Why num in printf("At the beginning, num = %d\n", num) is different from that in copyin?
In OpenMP website, it said
In parallel regions, references by the master thread will be to the copy of the variable in the thread that encountered the parallel region.
According to this explanation, Thread 0 (the master thread) should firstly contains num = 4. Therefore, loop 0's output should always be: Thread 0 is processing loop 0: num = 5. Why the result above is different?
My working environment is win10 operating system with VS2015.
I think the problem is within the fun.c compilation unit. The compiler cannot determine the extern int num; variable is also a TLS one.
I will include directive #pragma omp threadprivate(num) in this file:
// **** File: fun.c **** //
extern int num;
#pragma omp threadprivate(num)
int drand()
{
num = num + 1;
return num;
}
void seed(int num_initial)
{
num = num_initial;
}
// ************************ //
In any case, the compiler should warn about it at the linking phase.
The copyin clause is meant to be used in OpenMP teams (eg. computation on computing accelerators).
Indeed, the OpenMP documentation says:
These clauses support the copying of data values from private or threadprivate variables on one implicit task or thread to the corresponding variables on other implicit tasks or threads in the team.
Thus, in you case, you should rather use the clause firstprivate.
Please note that the version (5.0) of the OpenMP documentation your are reading is probably not supported by VS2015. I advise you to read an older version compatible with VS2015. The results of the compiled program are likely to be undefined.
I am trying to distribute the work of multiplying two NxN matrices across 3 nVidia GPUs using 3 OpenMP threads. (The matrix values will get large hence the long long data type.) However I am having trouble placing the #pragma acc parallel loop in the correct place. I have used some examples in the nVidia PDFs shared but to no luck. I know that the inner most loop cannot be parallelized. But I would like each of the three threads to own a GPU and do a portion of the work. Note that input and output matrices are defined as global variables as I kept running out of stack memory.
I have tried the code below, but I get compilation errors all pointing to line 75 which is the #pragma acc parallel loop line
[test#server ~]pgcc -acc -mp -ta=tesla:cc60 -Minfo=all -o testGPU matrixMultiplyopenmp.c
PGC-S-0035-Syntax error: Recovery attempted by replacing keyword for by keyword barrier (matrixMultiplyopenmp.c: 75)
PGC-S-0035-Syntax error: Recovery attempted by replacing acc by keyword enum (matrixMultiplyopenmp.c: 76)
PGC-S-0036-Syntax error: Recovery attempted by inserting ';' before keyword for (matrixMultiplyopenmp.c: 77)
PGC/x86-64 Linux 18.10-1: compilation completed with severe errors
Function is:
void multiplyMatrix(long long int matrixA[SIZE][SIZE], long long int matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE])
{
// Get Nvidia device type
acc_init(acc_device_nvidia);
// Get Number of GPUs in system
int num_gpus = acc_get_num_devices(acc_device_nvidia);
//Set the number of OpenMP thread to the number of GPUs
#pragma omp parallel num_threads(num_gpus)
{
//Get thread openMP number and set the GPU device to that number
int threadNum = omp_get_thread_num();
acc_set_device_num(threadNum, acc_device_nvidia);
int row;
int col;
int key;
#pragma omp for
#pragma acc parallel loop
for (row = 0; row < SIZE; row++)
for (col = 0; col < SIZE; col++)
for (key = 0; key < SIZE; key++)
matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
}
}
As fisehara points out, you can't have both an OpenMP "for" loop combined with an OpenACC parallel loop on the same for loop. Instead, you need to manually decompose the work across the OpenMP threads. Example below.
Is there a reason why you want to use multiple GPUs here? Most likely the matrix multiply will fit on to a single GPU so there's no need for the extra overhead of introducing host-side parallelization.
Also, I generally recommend using MPI+OpenACC for multi-gpu programming. Domain decomposition is naturally part of MPI but not inherent in OpenMP. Also, MPI gives you a one-to-one relationship between the host process and accelerator, allows for scaling beyond a single node, and you can take advantage of CUDA Aware MPI for direct GPU to GPU data transfers. For more info, do a web search for "MPI OpenACC" and you'll find several tutorials. Class #2 at https://developer.nvidia.com/openacc-advanced-course is a good resource.
% cat test.c
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#ifdef _OPENACC
#include <openacc.h>
#endif
#define SIZE 130
void multiplyMatrix(long long int matrixA[SIZE][SIZE], long long int matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE])
{
#ifdef _OPENACC
// Get Nvidia device type
acc_init(acc_device_nvidia);
// Get Number of GPUs in system
int num_gpus = acc_get_num_devices(acc_device_nvidia);
#else
int num_gpus = omp_get_max_threads();
#endif
if (SIZE<num_gpus) {
num_gpus=SIZE;
}
printf("Num Threads: %d\n",num_gpus);
//Set the number of OpenMP thread to the number of GPUs
#pragma omp parallel num_threads(num_gpus)
{
//Get thread openMP number and set the GPU device to that number
int threadNum = omp_get_thread_num();
#ifdef _OPENACC
acc_set_device_num(threadNum, acc_device_nvidia);
printf("THID %d using GPU: %d\n",threadNum,threadNum);
#endif
int row;
int col;
int key;
int start, end;
int block_size;
block_size = SIZE/num_gpus;
start = threadNum*block_size;
end = start+block_size;
if (threadNum==(num_gpus-1)) {
// add the residual to the last thread
end = SIZE;
}
printf("THID: %d, Start: %d End: %d\n",threadNum,start,end-1);
#pragma acc parallel loop \
copy(matrixProduct[start:end-start][:SIZE]), \
copyin(matrixA[start:end-start][:SIZE],matrixB[:SIZE][:SIZE])
for (row = start; row < end; row++) {
#pragma acc loop vector
for (col = 0; col < SIZE; col++) {
for (key = 0; key < SIZE; key++) {
matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
}}}
}
}
int main() {
long long int matrixA[SIZE][SIZE];
long long int matrixB[SIZE][SIZE];
long long int matrixProduct[SIZE][SIZE];
int i,j;
for(i=0;i<SIZE;++i) {
for(j=0;j<SIZE;++j) {
matrixA[i][j] = (i*SIZE)+j;
matrixB[i][j] = (j*SIZE)+i;
matrixProduct[i][j]=0;
}
}
multiplyMatrix(matrixA,matrixB,matrixProduct);
printf("Result:\n");
for(i=0;i<SIZE;++i) {
printf("%d: %ld %ld\n",i,matrixProduct[i][0],matrixProduct[i][SIZE-1]);
}
}
% pgcc test.c -mp -ta=tesla -Minfo=accel,mp
multiplyMatrix:
28, Parallel region activated
49, Generating copyin(matrixB[:130][:])
Generating copy(matrixProduct[start:end-start][:131])
Generating copyin(matrixA[start:end-start][:131])
Generating Tesla code
52, #pragma acc loop gang /* blockIdx.x */
54, #pragma acc loop vector(128) /* threadIdx.x */
55, #pragma acc loop seq
54, Loop is parallelizable
55, Complex loop carried dependence of matrixA->,matrixProduct->,matrixB-> prevents parallelization
Loop carried dependence of matrixProduct-> prevents parallelization
Loop carried backward dependence of matrixProduct-> prevents vectorization
59, Parallel region terminated
% a.out
Num Threads: 4
THID 0 using GPU: 0
THID: 0, Start: 0 End: 31
THID 1 using GPU: 1
THID: 1, Start: 32 End: 63
THID 3 using GPU: 3
THID: 3, Start: 96 End: 129
THID 2 using GPU: 2
THID: 2, Start: 64 End: 95
Result:
0: 723905 141340355
1: 1813955 425843405
2: 2904005 710346455
3: 3994055 994849505
...
126: 138070205 35988724655
127: 139160255 36273227705
128: 140250305 36557730755
129: 141340355 36842233805
I ran into an issue with MPI+OpenACC compilation on the shared system I was restricted to and could not upgrade the compiler. The solution I ended up using, was breaking the work down with OMP first then calling an OpenACC function as follows:
//Main code
pragma omp parallel num_threads(num_gpus)
{
#pragma omp for private(tid)
for (tid = 0; tid < num_gpus; tid++)
{
//Get thread openMP number and set the GPU device to that number
int threadNum = omp_get_thread_num();
acc_set_device_num(threadNum, acc_device_nvidia);
// check with thread is using which GPU
int gpu_num = acc_get_device_num(acc_device_nvidia);
printf("Thread # %d is going to use GPU # %d \n", threadNum, gpu_num);
//distribute the uneven rows
if (threadNum < extraRows)
{
startRow = threadNum * (rowsPerThread + 1);
stopRow = startRow + rowsPerThread;
}
else
{
startRow = threadNum * rowsPerThread + extraRows;
stopRow = startRow + (rowsPerThread - 1);
}
// Debug to check allocation of data to threads
//printf("Start row is %d, and Stop rows is %d \n", startRow, stopRow);
GPUmultiplyMatrix(matrixA, matrixB, matrixProduct, startRow, stopRow);
}
}
void GPUmultiplyMatrix(long long int matrixA[SIZE][SIZE], long long int
matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE], int
startRow, int stopRow)
{
int row;
int col;
int key;
#pragma acc parallel loop collapse (2)
for (row = startRow; row <= stopRow; row++)
for (col = 0; col < SIZE; col++)
for (key = 0; key < SIZE; key++)
matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
}
I have been trying to create a Multithreaded program that calculates the multiples of 3 and 5 from 1 to 999 but I can't seem to get it right every time I run it I get a different value I think it might have to do with the fact that I use a shared variable with 10 threads but I have no idea how to get around that. Also The program does work if I calculate the multiples of 3 and 5 from 1 to 9.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#define NUM_THREADS 10
#define MAX 1000
//finds multiples of 3 and 5 and sums up all of the multiples
int main(int argc, char ** argv)
{
omp_set_num_threads(10);//set number of threads to be used in the parallel loop
unsigned int NUMS[1000] = { 0 };
int j = 0;
#pragma omp parallel
{
int ID = omp_get_thread_num();//get thread ID
int i;
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
}
int i = 0;
unsigned int total;
for(i = 0; NUMS[i] != 0; i++)total += NUMS[i];//add up multiples of 3 and 5
printf("Total : %d\n", total);
return 0;
}
"j++" is not an atomic operation.
It means "take the value contained at the storage location called j, use it in the current statement, add one to it, then store it back in the same location it came from".
(That's the simple answer. Optimization and whether or not the value is kept in a register can and will change things even more.)
When you have multiple threads doing that to the same variable all at the same time, you get different and unpredictable results.
You can use thread variables to get around that.
In your code j is a shared inductive variable. You can't rely on using shared inductive variables efficiently with multiple threads (using atomic every iteration is not efficient).
You could find a special solution not using inductive variables (for example using wheel factorization with seven spokes {0,3,5,6,9,10,12} out of 15) or you could find a general solution using private inductive variables like this
#pragma omp parallel
{
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
#pragma omp for schedule(static) ordered
for(i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
{
memcpy(&NUMS[j], NUMS_local, sizeof *NUMS *k);
j += k;
}
}
}
This solution does not make optimal use of memory however. A better solution would use something like std::vector from C++ which you could implement for example using realloc in C but I'm not going to do that for you.
Edit:
Here is a special solution which does not use shared inductive variables using wheel factorization
int wheel[] = {0,3,5,6,9,10,12};
int n = MAX/15;
#pragma omp parallel for reduction(+:total)
for(int i=0; i<n; i++) {
for(int k=0; k<7; k++) {
NUMS[7*i + k] = 7*i + wheel[k];
total += NUMS[7*i + k];
}
}
//now clean up for MAX not a multiple of 15
int j = n*7;
for(int i=n*15; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS[j++] = i;
total += i;
}
}
Edit: It's possible to do this without a critical section (from the ordered clause). This does memcpy in parallel and also makes better use of memory at least for the shared array.
int *NUMS;
int *prefix;
int total=0, j;
#pragma omp parallel
{
int i;
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
{
prefix = malloc(sizeof *prefix * (nthreads+1));
prefix[0] = 0;
}
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
prefix[ithread+1] = k;
#pragma omp barrier
#pragma omp single
{
for(i=1; i<nthreads+1; i++) prefix[i+1] += prefix[i];
NUMS = malloc(sizeof *NUMS * prefix[nthreads]);
j = prefix[nthreads];
}
memcpy(&NUMS[prefix[ithread]], NUMS_local, sizeof *NUMS *k);
}
free(prefix);
This is a typical thread synchronization issue. All you need to do is using a kernel synchronization object for the sake of atomicity of any desired operation (incrementing the value of variable j in your case). It would be a mutex, semaphore or an event object depending on the operating system you're working on. But whatever your development environment is, to provide atomicity, the fundamental flow logic should be like the following pseudo-code:
{
lock(kernel_object)
// ...
// do your critical operation (increment your variable j in your case)
// ++j;
// ...
unlock(kernel_object)
}
If you're working on Windows operating system, there are some special synchronization mechanisms provided by the environment (i.e: InterlockedIncrement or CreateCriticalSection etc.) If you're working on a Unix/Linux based operating system, you can use mutex or semaphore kernel synchronization objects. Actually all those synchronization mechanism are stem from the concept of semaphores which is invented by Edsger W. Dijkstra in the begining of 1960's.
Here's some basic examples below:
Linux
#include <pthread.h>
pthread_mutex_t g_mutexObject = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char* argv[])
{
// ...
pthread_mutex_lock(&g_mutexObject);
++j; // incrementing j atomically
pthread_mutex_unlock(&g_mutexObject);
// ...
pthread_mutex_destroy(&g_mutexObject);
// ...
exit(EXIT_SUCCESS);
}
Windows
#include <Windows.h>
CRITICAL_SECTION g_csObject;
int main(void)
{
// ...
InitializeCriticalSection(&g_csObject);
// ...
EnterCriticalSection(&g_csObject);
++j; // incrementing j atomically
LeaveCriticalSection(&g_csObject);
// ...
DeleteCriticalSection(&g_csObject);
// ...
exit(EXIT_SUCCESS);
}
or just simply:
#include <Windows.h>
LONG volatile g_j; // our little j must be volatile in here now
int main(void)
{
// ...
InterlockedIncrement(&g_j); // incrementing j atomically
// ...
exit(EXIT_SUCCESS);
}
The problem you have is that threads doesn't necesarlly execute in order so the last thread to wirete may not have read the value in order so you overwrite wrong data.
There is a form to set that the threads in a loop, do a sumatory when they finish with the openmp options. You have to wirte somthing like this to use it.
#pragma omp parallel for reduction(+:sum)
for(k=0;k<num;k++)
{
sum = sum + A[k]*B[k];
}
/* Fin del computo */
gettimeofday(&fin,NULL);
all you have to do is write the result in "sum", this is from an old code i have that do a sumatory.
The other option you have is the dirty one. Someway, make the threads wait and get in order using a call to the OS. This is easier than it looks. This will be a solution.
#pragma omp parallel
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
printf("asdasdasdasdasdasdasdas");
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
but i recommendo you to read fully the openmp options.
I have an implementation of parallel bubble sort algorithm(Odd-Even transposition sort) in C, using OpenMP. However, after I tested it it's slower than the serial version(by about 10%) although I have a 4 cores processor ( 2 real x 2 because of Intel hyperthreading). I have checked to see if the cores are actually used and I can see them at 100% each when running the program. Therefore I think I did a mistake in the implementation the algorithm.
I am using linux with kernel 2.6.38-8-generic.
This is how I compile:
gcc -o bubble-sort bubble-sort.c -Wall -fopenmp or
gcc -o bubble-sort bubble-sort.c -Wall -fopenmp for the serial version
This is how i run:
./bubble-sort < in_10000 > out_10000
#include <omp.h>
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int main()
{
int i, n, tmp, *x, changes;
int chunk;
scanf("%d ", &n);
chunk = n / 4;
x = (int*) malloc(n * sizeof(int));
for(i = 0; i < n; ++i)
scanf("%d ", &x[i]);
changes = 1;
int nr = 0;
while(changes)
{
#pragma omp parallel private(tmp)
{
nr++;
changes = 0;
#pragma omp for \
reduction(+:changes)
for(i = 0; i < n - 1; i = i + 2)
{
if(x[i] > x[i+1] )
{
tmp = x[i];
x[i] = x[i+1];
x[i+1] = tmp;
++changes;
}
}
#pragma omp for \
reduction(+:changes)
for(i = 1; i < n - 1; i = i + 2)
{
if( x[i] > x[i+1] )
{
tmp = x[i];
x[i] = x[i+1];
x[i+1] = tmp;
++changes;
}
}
}
}
return 0;
}
Later edit:
It seems to work well now after I made the changes you suggested. It also scales pretty good(I tested on 8 physical cores too -> took 21s for a set of 150k numbers which is far less than on one core). However if I set the OMP_SCHEDULE environment variable myself the performance decreases...
You should profile it and check where threads spend time.
One possible reason is that parallel regions are constantly created and destroyed; depending on OpenMP implementation, it could lead to re-creation of the thread pool, though good implementations should probably handle this case.
Some small things to shave off:
- ok seems completely unnecessary, you can just change the loop exit condition to i<n-1;
- explicit barrier is unnecessary - first, you put it out of parallel regions so it makes no sense; and second, OpenMP parallel regions and loops have implicit barriers at the end;
- combine at least the two consequent parallel regions inside the while loop:
#pragma omp parallel private(tmp)
{
#pragma omp for bla-bla
for (i=0; i<n-1; i+=2 ) {...}
#pragma omp for bla-bla
for (i=1; i<n-1; i+=2 ) {...}
}