I have been trying to create a Multithreaded program that calculates the multiples of 3 and 5 from 1 to 999 but I can't seem to get it right every time I run it I get a different value I think it might have to do with the fact that I use a shared variable with 10 threads but I have no idea how to get around that. Also The program does work if I calculate the multiples of 3 and 5 from 1 to 9.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#define NUM_THREADS 10
#define MAX 1000
//finds multiples of 3 and 5 and sums up all of the multiples
int main(int argc, char ** argv)
{
omp_set_num_threads(10);//set number of threads to be used in the parallel loop
unsigned int NUMS[1000] = { 0 };
int j = 0;
#pragma omp parallel
{
int ID = omp_get_thread_num();//get thread ID
int i;
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
}
int i = 0;
unsigned int total;
for(i = 0; NUMS[i] != 0; i++)total += NUMS[i];//add up multiples of 3 and 5
printf("Total : %d\n", total);
return 0;
}
"j++" is not an atomic operation.
It means "take the value contained at the storage location called j, use it in the current statement, add one to it, then store it back in the same location it came from".
(That's the simple answer. Optimization and whether or not the value is kept in a register can and will change things even more.)
When you have multiple threads doing that to the same variable all at the same time, you get different and unpredictable results.
You can use thread variables to get around that.
In your code j is a shared inductive variable. You can't rely on using shared inductive variables efficiently with multiple threads (using atomic every iteration is not efficient).
You could find a special solution not using inductive variables (for example using wheel factorization with seven spokes {0,3,5,6,9,10,12} out of 15) or you could find a general solution using private inductive variables like this
#pragma omp parallel
{
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
#pragma omp for schedule(static) ordered
for(i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
{
memcpy(&NUMS[j], NUMS_local, sizeof *NUMS *k);
j += k;
}
}
}
This solution does not make optimal use of memory however. A better solution would use something like std::vector from C++ which you could implement for example using realloc in C but I'm not going to do that for you.
Edit:
Here is a special solution which does not use shared inductive variables using wheel factorization
int wheel[] = {0,3,5,6,9,10,12};
int n = MAX/15;
#pragma omp parallel for reduction(+:total)
for(int i=0; i<n; i++) {
for(int k=0; k<7; k++) {
NUMS[7*i + k] = 7*i + wheel[k];
total += NUMS[7*i + k];
}
}
//now clean up for MAX not a multiple of 15
int j = n*7;
for(int i=n*15; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS[j++] = i;
total += i;
}
}
Edit: It's possible to do this without a critical section (from the ordered clause). This does memcpy in parallel and also makes better use of memory at least for the shared array.
int *NUMS;
int *prefix;
int total=0, j;
#pragma omp parallel
{
int i;
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
{
prefix = malloc(sizeof *prefix * (nthreads+1));
prefix[0] = 0;
}
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
prefix[ithread+1] = k;
#pragma omp barrier
#pragma omp single
{
for(i=1; i<nthreads+1; i++) prefix[i+1] += prefix[i];
NUMS = malloc(sizeof *NUMS * prefix[nthreads]);
j = prefix[nthreads];
}
memcpy(&NUMS[prefix[ithread]], NUMS_local, sizeof *NUMS *k);
}
free(prefix);
This is a typical thread synchronization issue. All you need to do is using a kernel synchronization object for the sake of atomicity of any desired operation (incrementing the value of variable j in your case). It would be a mutex, semaphore or an event object depending on the operating system you're working on. But whatever your development environment is, to provide atomicity, the fundamental flow logic should be like the following pseudo-code:
{
lock(kernel_object)
// ...
// do your critical operation (increment your variable j in your case)
// ++j;
// ...
unlock(kernel_object)
}
If you're working on Windows operating system, there are some special synchronization mechanisms provided by the environment (i.e: InterlockedIncrement or CreateCriticalSection etc.) If you're working on a Unix/Linux based operating system, you can use mutex or semaphore kernel synchronization objects. Actually all those synchronization mechanism are stem from the concept of semaphores which is invented by Edsger W. Dijkstra in the begining of 1960's.
Here's some basic examples below:
Linux
#include <pthread.h>
pthread_mutex_t g_mutexObject = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char* argv[])
{
// ...
pthread_mutex_lock(&g_mutexObject);
++j; // incrementing j atomically
pthread_mutex_unlock(&g_mutexObject);
// ...
pthread_mutex_destroy(&g_mutexObject);
// ...
exit(EXIT_SUCCESS);
}
Windows
#include <Windows.h>
CRITICAL_SECTION g_csObject;
int main(void)
{
// ...
InitializeCriticalSection(&g_csObject);
// ...
EnterCriticalSection(&g_csObject);
++j; // incrementing j atomically
LeaveCriticalSection(&g_csObject);
// ...
DeleteCriticalSection(&g_csObject);
// ...
exit(EXIT_SUCCESS);
}
or just simply:
#include <Windows.h>
LONG volatile g_j; // our little j must be volatile in here now
int main(void)
{
// ...
InterlockedIncrement(&g_j); // incrementing j atomically
// ...
exit(EXIT_SUCCESS);
}
The problem you have is that threads doesn't necesarlly execute in order so the last thread to wirete may not have read the value in order so you overwrite wrong data.
There is a form to set that the threads in a loop, do a sumatory when they finish with the openmp options. You have to wirte somthing like this to use it.
#pragma omp parallel for reduction(+:sum)
for(k=0;k<num;k++)
{
sum = sum + A[k]*B[k];
}
/* Fin del computo */
gettimeofday(&fin,NULL);
all you have to do is write the result in "sum", this is from an old code i have that do a sumatory.
The other option you have is the dirty one. Someway, make the threads wait and get in order using a call to the OS. This is easier than it looks. This will be a solution.
#pragma omp parallel
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
printf("asdasdasdasdasdasdasdas");
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
but i recommendo you to read fully the openmp options.
Related
By using OpenMP I'm trying to parallelize the creation of a kind of dictionary so defined.
typedef struct Symbol {
int usage;
char character;
} Symbol;
typedef struct SymbolDictionary {
int charsNr;
Symbol *symbols;
} SymbolDictionary;
I did the following code.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdbool.h>
#include <omp.h>
static const int n = 10;
int main(int argc, char* argv[]) {
int thread_count = strtol(argv[1], NULL, 10);
omp_set_dynamic(0);
omp_set_num_threads(thread_count);
SymbolDictionary **symbolsDict = calloc(omp_get_max_threads(), sizeof(SymbolDictionary*));
SymbolDictionary *dict = NULL;
int count = 0;
#pragma omp parallel for firstprivate(dict, count) shared(symbolsDict)
for (int i = 0; i < n; i++) {
if (count == 0) {
dict = calloc(1, sizeof(SymbolDictionary));
dict->charsNr = 0;
dict->symbols = calloc(n, sizeof(Symbol));
#pragma omp critical
symbolsDict[omp_get_thread_num()] = dict;
}
dict->symbols[count].usage = i;
dict->symbols[count].character = 'a' + i;
++dict->charsNr;
++count;
}
if (omp_get_max_threads() > 1) {
// merge the dictionaries
}
for (int j = 0; j < symbolsDict[0]->charsNr; j++)
printf("symbolsDict[0][%d].character: %c\nsymbolsDict[0][%d].usage: %d\n",
j,
symbolsDict[0]->symbols[j].character,
j,
symbolsDict[0]->symbols[j].usage);
for (int i = 0; i < omp_get_max_threads(); i++)
free(symbolsDict[i]->symbols);
free(symbolsDict);
return 0;
}
The code compiles and runs, but I'm not sure about how the omp block works and if I implemented it correctly. Especially I have to attach the dict with the symbolsDict at the beginning of the loop, because I don't know when a thread will complete its work. However, by doing that probably different threads will write inside symbolsDict at the same time but in different memory. Although the threads will use different access points, dict should be different for every thread, I'm not sure this is a good way to do that.
I tested the code with different threads and creating dictionaries of different sizes.
I didn't have any kind of problem, but maybe it was just chance.
Basically I looked for the theory part around on the documentation. So I would like to know if I implemented the code correctly? If not, what is incorrect and why?
different threads will write inside symbolsDict at the same time but in different memory. Although the threads will use different access points, dict should be different for every thread, I'm not sure this is a good way to do that.
It isn't a good way but it is safe. A cleaner way would be this:
SymbolDictionary **symbolsDict = calloc(
omp_get_max_threads(), sizeof(SymbolDictionary*));
#pragma omp parallel
{
SymbolDictionary *dict = calloc(1, sizeof(SymbolDictionary));
int count = 0;
dict->charsNr = 0;
dict->symbols = calloc(n, sizeof(Symbol));
symbolsDict[omp_get_thread_num()] = dict;
# pragma omp for nowait
for(int i = 0; i < n; i++) {
dict->symbols[count].usage = i;
dict->symbols[count].character = 'a' + i;
++dict->charsNr;
++count;
}
}
Note that the inner pragma is omp for, not omp parallel for so it is using the outer parallel block to distribute its work. The nowait is a performance improvement that avoids a thread barrier at the end of the loop since it is the last part of the parallel section and threads wait for all other threads at the end of the section anyway.
I am trying to distribute the work of multiplying two NxN matrices across 3 nVidia GPUs using 3 OpenMP threads. (The matrix values will get large hence the long long data type.) However I am having trouble placing the #pragma acc parallel loop in the correct place. I have used some examples in the nVidia PDFs shared but to no luck. I know that the inner most loop cannot be parallelized. But I would like each of the three threads to own a GPU and do a portion of the work. Note that input and output matrices are defined as global variables as I kept running out of stack memory.
I have tried the code below, but I get compilation errors all pointing to line 75 which is the #pragma acc parallel loop line
[test#server ~]pgcc -acc -mp -ta=tesla:cc60 -Minfo=all -o testGPU matrixMultiplyopenmp.c
PGC-S-0035-Syntax error: Recovery attempted by replacing keyword for by keyword barrier (matrixMultiplyopenmp.c: 75)
PGC-S-0035-Syntax error: Recovery attempted by replacing acc by keyword enum (matrixMultiplyopenmp.c: 76)
PGC-S-0036-Syntax error: Recovery attempted by inserting ';' before keyword for (matrixMultiplyopenmp.c: 77)
PGC/x86-64 Linux 18.10-1: compilation completed with severe errors
Function is:
void multiplyMatrix(long long int matrixA[SIZE][SIZE], long long int matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE])
{
// Get Nvidia device type
acc_init(acc_device_nvidia);
// Get Number of GPUs in system
int num_gpus = acc_get_num_devices(acc_device_nvidia);
//Set the number of OpenMP thread to the number of GPUs
#pragma omp parallel num_threads(num_gpus)
{
//Get thread openMP number and set the GPU device to that number
int threadNum = omp_get_thread_num();
acc_set_device_num(threadNum, acc_device_nvidia);
int row;
int col;
int key;
#pragma omp for
#pragma acc parallel loop
for (row = 0; row < SIZE; row++)
for (col = 0; col < SIZE; col++)
for (key = 0; key < SIZE; key++)
matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
}
}
As fisehara points out, you can't have both an OpenMP "for" loop combined with an OpenACC parallel loop on the same for loop. Instead, you need to manually decompose the work across the OpenMP threads. Example below.
Is there a reason why you want to use multiple GPUs here? Most likely the matrix multiply will fit on to a single GPU so there's no need for the extra overhead of introducing host-side parallelization.
Also, I generally recommend using MPI+OpenACC for multi-gpu programming. Domain decomposition is naturally part of MPI but not inherent in OpenMP. Also, MPI gives you a one-to-one relationship between the host process and accelerator, allows for scaling beyond a single node, and you can take advantage of CUDA Aware MPI for direct GPU to GPU data transfers. For more info, do a web search for "MPI OpenACC" and you'll find several tutorials. Class #2 at https://developer.nvidia.com/openacc-advanced-course is a good resource.
% cat test.c
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#ifdef _OPENACC
#include <openacc.h>
#endif
#define SIZE 130
void multiplyMatrix(long long int matrixA[SIZE][SIZE], long long int matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE])
{
#ifdef _OPENACC
// Get Nvidia device type
acc_init(acc_device_nvidia);
// Get Number of GPUs in system
int num_gpus = acc_get_num_devices(acc_device_nvidia);
#else
int num_gpus = omp_get_max_threads();
#endif
if (SIZE<num_gpus) {
num_gpus=SIZE;
}
printf("Num Threads: %d\n",num_gpus);
//Set the number of OpenMP thread to the number of GPUs
#pragma omp parallel num_threads(num_gpus)
{
//Get thread openMP number and set the GPU device to that number
int threadNum = omp_get_thread_num();
#ifdef _OPENACC
acc_set_device_num(threadNum, acc_device_nvidia);
printf("THID %d using GPU: %d\n",threadNum,threadNum);
#endif
int row;
int col;
int key;
int start, end;
int block_size;
block_size = SIZE/num_gpus;
start = threadNum*block_size;
end = start+block_size;
if (threadNum==(num_gpus-1)) {
// add the residual to the last thread
end = SIZE;
}
printf("THID: %d, Start: %d End: %d\n",threadNum,start,end-1);
#pragma acc parallel loop \
copy(matrixProduct[start:end-start][:SIZE]), \
copyin(matrixA[start:end-start][:SIZE],matrixB[:SIZE][:SIZE])
for (row = start; row < end; row++) {
#pragma acc loop vector
for (col = 0; col < SIZE; col++) {
for (key = 0; key < SIZE; key++) {
matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
}}}
}
}
int main() {
long long int matrixA[SIZE][SIZE];
long long int matrixB[SIZE][SIZE];
long long int matrixProduct[SIZE][SIZE];
int i,j;
for(i=0;i<SIZE;++i) {
for(j=0;j<SIZE;++j) {
matrixA[i][j] = (i*SIZE)+j;
matrixB[i][j] = (j*SIZE)+i;
matrixProduct[i][j]=0;
}
}
multiplyMatrix(matrixA,matrixB,matrixProduct);
printf("Result:\n");
for(i=0;i<SIZE;++i) {
printf("%d: %ld %ld\n",i,matrixProduct[i][0],matrixProduct[i][SIZE-1]);
}
}
% pgcc test.c -mp -ta=tesla -Minfo=accel,mp
multiplyMatrix:
28, Parallel region activated
49, Generating copyin(matrixB[:130][:])
Generating copy(matrixProduct[start:end-start][:131])
Generating copyin(matrixA[start:end-start][:131])
Generating Tesla code
52, #pragma acc loop gang /* blockIdx.x */
54, #pragma acc loop vector(128) /* threadIdx.x */
55, #pragma acc loop seq
54, Loop is parallelizable
55, Complex loop carried dependence of matrixA->,matrixProduct->,matrixB-> prevents parallelization
Loop carried dependence of matrixProduct-> prevents parallelization
Loop carried backward dependence of matrixProduct-> prevents vectorization
59, Parallel region terminated
% a.out
Num Threads: 4
THID 0 using GPU: 0
THID: 0, Start: 0 End: 31
THID 1 using GPU: 1
THID: 1, Start: 32 End: 63
THID 3 using GPU: 3
THID: 3, Start: 96 End: 129
THID 2 using GPU: 2
THID: 2, Start: 64 End: 95
Result:
0: 723905 141340355
1: 1813955 425843405
2: 2904005 710346455
3: 3994055 994849505
...
126: 138070205 35988724655
127: 139160255 36273227705
128: 140250305 36557730755
129: 141340355 36842233805
I ran into an issue with MPI+OpenACC compilation on the shared system I was restricted to and could not upgrade the compiler. The solution I ended up using, was breaking the work down with OMP first then calling an OpenACC function as follows:
//Main code
pragma omp parallel num_threads(num_gpus)
{
#pragma omp for private(tid)
for (tid = 0; tid < num_gpus; tid++)
{
//Get thread openMP number and set the GPU device to that number
int threadNum = omp_get_thread_num();
acc_set_device_num(threadNum, acc_device_nvidia);
// check with thread is using which GPU
int gpu_num = acc_get_device_num(acc_device_nvidia);
printf("Thread # %d is going to use GPU # %d \n", threadNum, gpu_num);
//distribute the uneven rows
if (threadNum < extraRows)
{
startRow = threadNum * (rowsPerThread + 1);
stopRow = startRow + rowsPerThread;
}
else
{
startRow = threadNum * rowsPerThread + extraRows;
stopRow = startRow + (rowsPerThread - 1);
}
// Debug to check allocation of data to threads
//printf("Start row is %d, and Stop rows is %d \n", startRow, stopRow);
GPUmultiplyMatrix(matrixA, matrixB, matrixProduct, startRow, stopRow);
}
}
void GPUmultiplyMatrix(long long int matrixA[SIZE][SIZE], long long int
matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE], int
startRow, int stopRow)
{
int row;
int col;
int key;
#pragma acc parallel loop collapse (2)
for (row = startRow; row <= stopRow; row++)
for (col = 0; col < SIZE; col++)
for (key = 0; key < SIZE; key++)
matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
}
I would like to proceed a multi-thread program where each thread outputs an array of unknown number of elements.
For example, select all numbers that < 10 from an int array and put them into a new array.
Pseudo code (8 threads):
int *hugeList = malloc(10000000);
for (long i = 0; i < 1000000; ++i)
{
hugeList[i] = (rand() % 100);//random integers from 0 to 99
}
long *subList[8];//to fill each thread's result
#pragma omp parallel
for (long i = 0; i < 1000000; ++i)
{
long n = 0;
if(hugeList[i] < 10)
{
//do something to fill "subList" properly
subList[threadNo][n] = hugeList[i];
n++;
}
}
Array "subList" should collect the elements in "hugeList" which satisfies condition (<10) ,sequentially and in terms of thread number.
How should I write the code? It is OK if there is a better way using OpenMP.
There are several problems in your code.
1/ omp pragma should be parallel for, if you want the for loop to be parallelized. Otherwise, code will be duplicated in everay thread.
2/ code is incoherent with comment
//do something to fill "subList" properly
hugeList[i] = subList[threadNo][n];
3/ How do you know the number of element in your sublists? It must be returned to main thread. You could use an array, but beware of false sharing. Better use a local var and write it at the end the parallel section.
4/ sublist is not allocated. The difficulty is that you do not know the number of threads. You can ask omp the max number of thread (get_omp_max_thread), and do dynamic allocation. If you want some static allocation, maybe the best is to allocate a large table and to compute the actual address in every thread.
5/ omp code must also work without an openmp compiler. Use #ifdef _OPENMP for that.
Here is an (untested) way your code can be written
#define HUGE 10000000
int *hugeList = (int *) malloc(HUGE);
#ifdef _OPENMP
int thread_nbr=omp_get_max_threads();
#else
int thread_nbr=1; // to ensure proper behavior in a sequential context
#endif
struct thread_results { // to hold per thread results
int nbr; // nbr of generated results
int *results; // actual filtered numbers. Will write in subList table
};
// could be parallelized, but rand is not thread safe. drand48 should be
for (long i = 0; i < 1000000; ++i)
{
hugeList[i] = (rand() % 100);//random integers from 0 to 99
}
int *subList=(int *)malloc(HUGE*sizeof(int)); // table to hold thread results
// this is more complex to have a 2D array here as max_thread and actual number of thread
// are not known at compile time. VLA cannot be used (and array dim can be very large).
// Concerning its size, it is possible to have ALL elements in hugeList selected and the array must be
// dimensionned accordingly to avoid bugs.
struct thread_results* threadres=(struct thread_results *)malloc(thread_nbr*sizeof(struct thread_results));
#pragma omp parallel
{
// first declare and initialize thread vars
#ifdef _OPENMP
int thread_id = omp_get_thread_num() ; // hold thread id
int thread_nbr = omp_get_num_threads() ; // hold actual nbr of threads
#else
// to ensure proper serial behavior
int thread_id = 0;
int thread_nbr = 1;
#endif
struct thread_results *res=threadres+thread_id;
res->nbr=0;
// compute address in subList table
res->results=subList+(HUGE/thread_nbr)*thread_id;
int * res_ptr=res->results; // local pointer. Each thread points to independent part of subList table
int n=0; // number of results. We want one per thread to only have local updates.
#pragma omp for
for (long i = 0; i < 1000000; ++i)
{
if(hugeList[i] < 10)
{
//do something to fill "subList" properly
res_ptr[n]=hugeList[i];
n++;
}
}
res->nbr=n;
}
Updated complete codes based on #Alain Merigot 's answer
I tested the following code; It is reproducible (including presence & absence of #pragma arguments).
However, only the front elements of subList are correct, while the rest are empty.
(filename.c)
#include <stdio.h>
#include <time.h>
#include <omp.h>
#include <stdlib.h>
#include <math.h>
#define HUGE 10000000
#define DELAY 1000 //depends on your CPU power
//use global variables to store desired results, otherwise can't be obtain outside "pragma"
int n = 0;// number of results. We want one per thread to only have local updates.
double *subList;// table to hold thread results
int main()
{
double *hugeList = (double *)malloc(HUGE);
#ifdef _OPENMP
int thread_nbr = omp_get_max_threads();
#else
int thread_nbr = 1; // to ensure proper behavior in a sequential context
#endif
struct thread_results
{ // to hold per thread results
int nbr; // nbr of generated results
double *results; // actual filtered numbers. Will write in subList table
};
// could be parallelized, but rand is not thread safe. drand48 should be
for (long i = 0; i < 1000000; ++i)
{
hugeList[i] = sin(i); //fixed array content to test reproducibility
}
subList = (double *)malloc(HUGE * sizeof(double)); // table to hold thread results
// this is more complex to have a 2D array here as max_thread and actual number of thread
// are not known at compile time. VLA cannot be used (and array dim can be very large).
// Concerning its size, it is possible to have ALL elements in hugeList selected and the array must be
// dimensionned accordingly to avoid bugs.
struct thread_results *threadres = (struct thread_results *)malloc(thread_nbr * sizeof(struct thread_results));
#pragma omp parallel
{
// first declare and initialize thread vars
#ifdef _OPENMP
int thread_id = omp_get_thread_num(); // hold thread id
int thread_nbr = omp_get_num_threads(); // hold actual nbr of threads
#else
// to ensure proper serial behavior
int thread_id = 0;
int thread_nbr = 1;
#endif
struct thread_results *res = threadres + thread_id;
res->nbr = 0;
// compute address in subList table
res->results = subList + (HUGE / thread_nbr) * thread_id;
double *res_ptr = res->results; // local pointer. Each thread points to independent part of subList table
#pragma omp for reduction(+ \
: n)
for (long i = 0; i < 1000000; ++i)
{
for (int i = 0; i < DELAY; ++i){}//do nothing, just waste time
if (hugeList[i] < 0)
{
//do something to fill "subList" properly
res_ptr[n] = hugeList[i];
n++;
}
}
res->nbr = n;
}
for (int i = 0; i < 10; ++i)
{
printf("sublist %d: %lf\n", i, subList[i]);//show some elements of subList to check reproducibility
}
printf("n = %d\n", n);
}
Linux compile: gcc -o filename filename.c -fopenmp -lm
I hope there can be more discussion of the mechanism of this code.
I want to parallelize tasks inside a for loop using OpenMP. However, I do not want to use #pragma omp parallel for as the result of the (i+1)th iteration depends on the output of the (i)th iteration. I have tried to spawn the threads inside the code, but the time of creating and destroying them every time is very high. An abstract description of my code is:
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
for (int i=0; i<1000; i++)
{
a_new = fun(a_old); //fun() depends only on the value of the argument
a_old = a_new;
b_new = fun(b_old);
b_old = b_new;
c_new = fun(c_old);
c_old = c_new;
d_new = fun(d_old);
d_old = d_new;
}
How can I efficiently use threads to calculate the new values of a_new, b_new, c_new, d_new in parallel in each iteration ?
Just don't parallelize the code inside the for loop - move the parallel region to the outside. This reduces the thread creation and worksharing overhead. Then you can easily apply OpenMP sections:
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
#pragma omp parallel sections
{
#pragma omp section
for (int i=0; i<1000; i++) {
a_new = fun(a_old); //fun() depends only on the value of the argument
a_old = a_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
b_new = fun(b_old);
b_old = b_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
c_new = fun(c_old);
c_old = c_new;
}
#pragma omp section
for (int i=0; i<1000; i++) {
d_new = fun(d_old);
d_old = d_new;
}
}
There is also another simplification:
int value[4];
#pragma omp parallel for
for (int abcd = 0; abcd < 4; abcd++) {
for (int i=0; i<1000; i++) {
value[abcd] = fun(value[abcd]);
}
}
In either case, you might want to consider adding padding between the values to avoid false sharing if fun executes rather quickly.
This pretty straight forward, as #kbr mentioned in the comments each of the calculation a,b,c and d are independent, so you can separate them to different threads and pass the corresponding value as parameter. The sample code looks like this.
#include<stdio.h>
#include <pthread.h>
void *thread_func(int *i)
{
for (int j=0; j<1000; j++)
{
//Instead of increment u can call whichever function you want here.
(*i)++;
}
}
int main()
{
int a_old=1;
int b_old=1;
int c_old=1;
int d_old=1;
pthread_t thread[4];
pthread_create(&thread[0],0,thread_func,&a_old);
pthread_create(&thread[1],0,thread_func,&b_old);
pthread_create(&thread[2],0,thread_func,&c_old);
pthread_create(&thread[3],0,thread_func,&d_old);
pthread_join(&thread[0],NULL);
pthread_join(&thread[1],NULL);
pthread_join(&thread[2],NULL);
pthread_join(&thread[3],NULL);
printf("a_old %d",a_old);
printf("b_old %d",b_old);
printf("c_old %d",c_old);
printf("d_old %d",d_old);
}
I am writing a program that will match up one block(a group of 4 double numbers which are within certain absolute value) with another.
Essentially, I will call the function in main.
The matrix has 4399 rows and 500 columns.I am trying to use OpenMp to speed up the task yet my code seems to have race condition within the innermost loop (where the actual creation of block happens create_Block(rrr[k], i); ).
It is ok to ignore all the function detail as they are working well in serial version. The only focus here is the OpenMP derivatives.
int main(void) {
readKey("keys.txt");
double** jz = readMatrix("data.txt");
int j = 0;
int i = 0;
int k = 0;
#pragma omp parallel for firstprivate(i) shared(Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (i = 0; i < 50; i++) {
printf("THIS IS COLUMN %d\n", i);
double*c = readCol(jz, i, 4400);
#pragma omp parallel for firstprivate(j) shared(i,Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (j=0; j < 4400; j++) {
// printf("This is fixed row %d from column %d !!!!!!!!!!\n",j,i);
int* one_collection = collection(c, j, 4400);
// MODIFY THE DYMANIC ALLOCATION OF SPACES (SIZE_OF_COMBINATION) IN combNonRec() function.
if (get_combination_size(SIZE_OF_COLLECTION, M) >= 4) {
//GET THE 2D-ARRAY OF COMBINATION
int** rrr = combNonRec(one_collection, SIZE_OF_COLLECTION, M);
#pragma omp parallel for firstprivate(k) shared(i,j,Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION,b)
for (k = 0; k < get_combination_size(SIZE_OF_COLLECTION, M); k++) {
create_Block(rrr[k], i); //ACTUAL CREATION OF BLOCK !!!!!!!
printf("This is block %d \n", NUM_OF_BLOCK);
add_To_Block_Collection();
}
free(rrr);
}
free(one_collection);
}
//OpenMP for j
free(c);
}
// OpenMP for i
collision();
}
Here is the parallel version result: non-deterministic
Whereas the serial result has constant 400 blocks.
Big_Block,NUM_OF_BLOCK,SIZE_OF_COLLECTION are global variable.
Did I do anything wrong in the derivative declaration? What might have caused such problem?