Calculate Pi algorithm from a tutorial using OpenMP - c

I am studying this tutorial about OpenMP and I came across this exercise, on page 19. It is a pi calculation algorithm which I have to parallelize:
static long num_steps = 100000;
double step;
void main ()
{
int i;
double x, pi
double sum = 0.0;
step = 1.0 / (double)num_steps;
for(i = 0; i < num_steps; i++)
{
x = (I + 0.5) * step;
sum = sum + 4.0 / (1.0 + x*x);
}
pi = step * sum;
}
I can not use, up to this point, #pragma parallel for. I can only use:
#pragma omp parallel {}
omp_get_thread_num();
omp_set_num_threads(int);
omp_get_num_threads();
My implementation looks like this :
#define NUM_STEPS 800
int main(int argc, char **argv)
{
int num_steps = NUM_STEPS;
int i;
double x;
double pi;
double step = 1.0 / (double)num_steps;
double sum[num_steps];
for(i = 0; i < num_steps; i++)
{
sum[i] = 0;
}
omp_set_num_threads(num_steps);
#pragma omp parallel
{
x = (omp_get_thread_num() + 0.5) * step;
sum[omp_get_thread_num()] += 4.0 / (1.0 + x * x);
}
double totalSum = 0;
for(i = 0; i < num_steps; i++)
{
totalSum += sum[i];
}
pi = step * totalSum;
printf("Pi: %.5f", pi);
}
Ignoring the problem by using an sum array (It explains later that it needs to define a critical section for the sum value with #pragma omp critical or #pragma omp atomic), the above impelentation only works for a limited number of threads (800 in my case), where the serial code uses 100000 steps. Is there a way to achieve this with only the aforementioned OpenMP commands, or am I obliged to use #pragma omp parallel for, which hasn't been mentioned yet in the tutorial?
Thanks a lot for your time, I am really trying to grasp the concept of parallelization in C using OpenMP.

You will need to find a way to make your parallel algorithm somewhat independent from the number of threads.
The most simple way is to do something like:
int tid = omp_get_thread_num();
int n_threads = omp_get_num_threads();
for (int i = tid; i < num_steps; i += n_threads) {
// ...
}
This way the work is split across all threads regardless of the number of threads.
If there were 3 threads and 9 steps:
Thread 0 would do steps 0, 3, 6
Thread 1 would do steps 1, 4, 7
Thread 2 would do steps 2, 5, 8
This works but isn't ideal if each thread is accessing data from some shared array. It is better if threads access sections of data nearby for locality purposes.
In that case you can divide the number of steps by the number of threads and give each thread a contiguous set of tasks like so:
int tid = omp_get_thread_num();
int n_threads = omp_get_num_threads();
int steps_per_thread = num_steps / n_threads;
int start = tid * steps_per_thread;
int end = start + steps_per_thread;
for (int i = start; i < end; i++) {
// ...
}
Now the 3 threads performing 9 steps looks like:
Thread 0 does steps 0, 1, 2
Thread 1 does steps 3, 4, 5
Thread 2 does steps 6, 7, 8
This approach is actually what is most likely happening when #pragma omp for is used. In most cases the compiler just divides the tasks according to the number of threads and assigns each thread a section.
So given a set of 2 threads and a 100 iteration for loop, the compiler would likely give iterations 0-49 to thread 0 and iterations 50-99 to thread 1.
Note that if the number of iterations does not divide evenly by the number of threads the remainder needs to be handled explicitly.

Related

Openmp pragma barrier for calculating pi in

I have a program in .C that uses openmp that can be seen below; the program is used to compute pi given a set of steps; however, I am new to openMp, so my knowledge is limited.
I'm attempting to implement a barrier for this program, but I believe one is already implicit, so I'm not sure if I even need to implement it.
Thank you!
#include <omp.h>
#include <stdio.h>
#define NUM_THREADS 4
static long num_steps = 100000000;
double step;
int main()
{
int i;
double start_time, run_time, pi, sum[NUM_THREADS];
omp_set_num_threads(NUM_THREADS);
step = 1.0 / (double)num_steps;
start_time = omp_get_wtime();
#pragma omp parallel
{
int i, id, currentThread;
double x;
id = omp_get_thread_num();
currentThread = omp_get_num_threads();
for (i = id, sum[id] = 0.0; i < num_steps; i = i + currentThread)
{
x = (i + 0.5) * step;
sum[id] = sum[id] + 4.0 / (1.0 + x * x);
}
}
run_time = omp_get_wtime() - start_time;
//we then get the value of pie
for (i = 0, pi = 0.0; i < NUM_THREADS; i++)
{
pi = pi + sum[i] * step;
}
printf("\n pi with %ld steps is %lf \n ", num_steps, pi);
printf("run time = %6.6f seconds\n", run_time);
}
In your case there is no need for an explicit barrier, there is an implicit barrier at the end of the parallel section.
Your code, however, has a performance issue. Different threads update adjacent elements of sum array which can cause false sharing:
When multiple threads access same cache line and at least one of them
writes to it, it causes costly invalidation misses and upgrades.
To avoid it you have to be sure that each element of the sum array is located on a different cache line, but there is a simpler solution: to use OpenMP's reduction clause. Please check this example suggested by #JeromeRichard. Using reduction your code should be something like this:
double sum=0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < num_steps; i++)
{
const double x = (i + 0.5) * step;
sum += 4.0 / (1.0 + x * x);
}
Note also that you should use your variables in their minimum required scope.

Why OpenMP reduction is slower than MPI on share memory structure?

I have tried to test OpenMP and MPI parallel implementation for inner products of two vectors (element values are computed on the fly) and find out that OpenMP is slower than MPI.
The MPI code I am using is as following,
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
#include <mpi.h>
int main(int argc, char* argv[])
{
double ttime = -omp_get_wtime();
int np, my_rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &np);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
int n = 10000;
int repeat = 10000;
int sublength = (int)(ceil((double)(n) / (double)(np)));
int nstart = my_rank * sublength;
int nend = nstart + sublength;
if (nend >n )
{
nend = n;
sublength = nend - nstart;
}
double dot = 0;
double sum = 1;
int j, k;
double time = -omp_get_wtime();
for (j = 0; j < repeat; j++)
{
double loc_dot = 0;
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
MPI_Allreduce(&loc_dot, &dot, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
sum += (dot/(double)(n));
}
time += omp_get_wtime();
if (my_rank == 0)
{
ttime += omp_get_wtime();
printf("np = %d sum = %f, loop time = %f sec, total time = %f \n", np, sum, time, ttime);
}
return 0;
}
I have tried several different implementation with OpenMP.
Here is the version which not to complicate and close to best performance I can achieve.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main(int argc, char* argv[])
{
int n = 10000;
int repeat = 10000;
int np = 1;
if (argc > 1)
{
np = atoi(argv[1]);
}
omp_set_num_threads(np);
int nstart =0;
int sublength =n;
double loc_dot = 0;
double sum = 1;
#pragma omp parallel
{
int i, j, k;
double time = -omp_get_wtime();
for (j = 0; j < repeat; j++)
{
#pragma omp for reduction(+: loc_dot)
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
#pragma omp single
{
sum += (loc_dot/(double)(n));
loc_dot =0;
}
}
time += omp_get_wtime();
#pragma omp single nowait
printf("sum = %f, time = %f sec, np = %d\n", sum, time, np);
}
return 0;
}
here is my test results:
OMP
sum = 6992.953984, time = 0.409850 sec, np = 1
sum = 6992.953984, time = 0.270875 sec, np = 2
sum = 6992.953984, time = 0.186024 sec, np = 4
sum = 6992.953984, time = 0.144010 sec, np = 8
sum = 6992.953984, time = 0.115188 sec, np = 16
sum = 6992.953984, time = 0.195485 sec, np = 32
MPI
sum = 6992.953984, time = 0.381701 sec, np = 1
sum = 6992.953984, time = 0.243513 sec, np = 2
sum = 6992.953984, time = 0.158326 sec, np = 4
sum = 6992.953984, time = 0.102489 sec, np = 8
sum = 6992.953984, time = 0.063975 sec, np = 16
sum = 6992.953984, time = 0.044748 sec, np = 32
Can anyone tell me what I am missing?
thanks!
update:
I have written an acceptable reduce function for OMP. the perfomance is close to MPI reduce function now. the code is as following.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
double darr[2][64];
int nreduce=0;
#pragma omp threadprivate(nreduce)
double OMP_Allreduce_dsum(double loc_dot,int tid,int np)
{
darr[nreduce][tid]=loc_dot;
#pragma omp barrier
double dsum =0;
int i;
for (i=0; i<np; i++)
{
dsum += darr[nreduce][i];
}
nreduce=1-nreduce;
return dsum;
}
int main(int argc, char* argv[])
{
int np = 1;
if (argc > 1)
{
np = atoi(argv[1]);
}
omp_set_num_threads(np);
double ttime = -omp_get_wtime();
int n = 10000;
int repeat = 10000;
#pragma omp parallel
{
int tid = omp_get_thread_num();
int sublength = (int)(ceil((double)(n) / (double)(np)));
int nstart = tid * sublength;
int nend = nstart + sublength;
if (nend >n )
{
nend = n;
sublength = nend - nstart;
}
double sum = 1;
double time = -omp_get_wtime();
int j, k;
for (j = 0; j < repeat; j++)
{
double loc_dot = 0;
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
double dot =OMP_Allreduce_dsum(loc_dot,tid,np);
sum +=(dot/(double)(n));
}
time += omp_get_wtime();
#pragma omp master
{
ttime += omp_get_wtime();
printf("np = %d sum = %f, loop time = %f sec, total time = %f \n", np, sum, time, ttime);
}
}
return 0;
}
First of all, this code is very sensitive to synchronization overheads (both software and hardware) resulting in apparent strange behaviors themselves to both the OpenMP runtime implementation and low-level processor operations (eg. cache/bus effects). Indeed, a full synchronization is required for each iteration of the j-based loop executed every 45 ms. This means 4.5 us/iteration. In such a short time, the partial-sum spread in 32 cores needs to be reduced and broadcasted. If each core accumulates its own value in a shared atomic location, taking for example 60 ns per atomic add (realistic overhead for atomics on scalable Xeon processors), it would take 32 * 60 ns = 1.92 us since this process is done sequentially on x86 processors so far. This small additional time represent an overhead of 43% on the overall execution time because of the barriers! Due to contention on atomic variables, timings are often much worse. Moreover, the barrier themselves are expensive (they are often implemented using atomics in OpenMP runtimes but in a way that could scale a bit better).
The first OpenMP implementation was slow because implicit synchronizations and complex hardware cache effects. Indeed, the omp for reduction directive performs an implicit barrier at the end of its region as well as omp single. The reduction itself can implemented in several ways. The OpenMP runtime of ICC use a clever tree-based atomic implementation which should scale quite well (but not perfectly). Moreover, the omp single section will cause some cache-line bouncing. Indeed, the result loc_dot will likely be stored in the cache of the last core updating it while the thread executing this section will likely scheduled on another core. In this case, the processor has to move the cache-line from one L2 cache to another (or load the value from the L3 cache directly regarding the hardware state). The same thing also apply for sum (which tends to move between cores as the thread executing the section will likely not be always scheduled on the same core). Finally, the sum variable must be broadcasted on each core so they can start a new iteration.
The last OpenMP implementation is significantly better since every thread works on its own local data, it uses only one barrier (this synchronization is mandatory regarding the algorithm) and caches are better used. The accumulation part may not be ideal as all cores will likely fetch data previously located on all other L1/L2 caches causing a all-to-all broadcast pattern. This hardware-operation can scale barely but should be sequential either.
Note that the last OpenMP implementation suffer from false-sharing. Indeed, items of darr will be stored contiguously in memory and share the same cache-line. As a result, when a thread writes in darr, the associated core will request the cache-line and invalidates the ones located on others cores. This causes cache-line bouncing between cores. However, on current x86 processors, cache lines are 64 bytes wise and a double variable takes 8 bytes resulting in 8 items per cache-line. Thus, it mitigates the effect cache-line bouncing typically to 8 cores over the 32 ones. That being said, the item packing has some benefits as only 4 cache-lines fetch are required per core to perform the global accumulation. To prevent false-sharing, one can allocate a (8 times) bigger array and reserve some space between items so that 1 item is stored per cache-line. The best strategy on your target processor may to use a tree-based atomic reduction like the one the ICC OpenMP runtime use. Ideally, the sum reduction and the barrier can be merged together for better performance. This is what the MPI implementation can do internally (MPI_Allreduce).
Note that all implementations suffer from the very high thread synchronization. This is a problem as some context switch regularly occurs on some core because of some operating-system/hardware events (network, storage device, user, system processes, etc.). One critical issue is frequency-scaling on any modern x86 processors: not all core will work at the same frequency and their frequency change over time. The slowest thread will slow down all the others because of the barrier. In the worst case, some threads may passively wait enabling some cores to sleep (C-states) and then take more time to wake up slowing further down the others depending on the platform configuration.
The takeaway is:
the more synchronized a code is, the lower its scaling and the challenging its optimization.

Generate same random matrix in OpenMP than sequential code

I'd like to generate a random matrix with OpenMP like it were generated by a sequential program, i.e. if any sequential matrix generator outputs me a matrix like the following one:
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 0.0 1.0 2.0
3.0 4.0 5.0 6.0
I want the parallel OpenMP version of the same program to generate the same matrix with no interleaved rows.
Here is how I gradually approached the problem.
Given my serial generator C function generating a matrix as a 1D array:
void generate_matrix_array(
double *v,
int rows,
int columns,
double min,
double max,
int seed
) {
srand(seed);
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
}
}
}
First, I naively tried the #pragma omp parallel for directive to outer for loop; however, there's no guarantee about row ordering, since thread execution gets interleaved, so they get generated in a non-deterministic order.
Adding the ordered option would solve the issue at the price of making useless multithreading in this particular case.
In order to solve the issue, I tried to partition by hand the matrix array so that thread i would generate the i-th slice of it:
void generate_matrix_array_par(
double *v,
int rows,
int columns,
double min,
double max,
int seed
) {
srand(seed);
#pragma omp parallel \
shared(v)
{
int tid = omp_get_thread_num();
int nthreads = omp_get_num_threads();
int rows_per_thread = round(rows / (double) nthreads);
int rem_rows = rows % (nthreads - 1) != 0?
rows % (nthreads - 1):
rows_per_thread;
int local_rows = (tid == 0)?
rows_per_thread:
rem_rows;
int lower_row = tid * local_rows;
int upper_row = ((tid + 1) * local_rows);
printf(
"[T%d] receiving %d of %d rows from row %d to %d\n",
tid,
local_rows,
rows,
lower_row,
upper_row - 1
);
printf("\n");
fflush(stdout);
for (int i = lower_row; i < upper_row; i++) {
for (int j = 0; j < columns; j++) {
v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
}
}
}
}
However, despite matrix vector gets properly divided among threads, for some reason unknown to me, every thread generates its rows into the matrix in a non-deterministic order, i.e. if I want to generate a 8x8 matrix with 4 threads and thread 3 is assigned to rows 4 and 5, he will generate two contiguous rows in the matrix array but in the wrong position every time, like if I didn't perform any partitioning and the omp parallel for directive was in place.
I skeptically tried, at last, to get back to naive approach by specifying shared(v) and schedule(static, 16) options to omp parallel for directive and it 'magically' happens to work:
void generate_matrix_array_par(
double *v,
int rows,
int columns,
double min,
double max,
int seed
) {
srand(seed);
int nthreads = omp_get_max_threads();
int chunk_size = (rows * columns) / nthreads;
#pragma omp parallel for \
shared(v) \
schedule(static, chunk_size)
for (int i = 0; i < rows; i++) {
for (int j = 0; j < columns; j++) {
v[i*rows + j] = min + (rand() / (RAND_MAX / (max - min)));
}
}
}
The schedule option is being added since I read somewhere else that it gets rid of cache conflicts. Edit: Looks like schedule splits up data to thread in a round-robin fashion according to a given chunk size; so if I share N/nthreads-sized chunks among threads, data will be assigned in a single round.
Any question? YES!!!
Now, I'd like to know whether I missed or failed some consideration about the problem, since I'm not convinced about the fairness of my last version of the program, despite the fact that it is working.

Using openmp to distribute matrix multiplication work across multiple GPUs via openacc using C

I am trying to distribute the work of multiplying two NxN matrices across 3 nVidia GPUs using 3 OpenMP threads. (The matrix values will get large hence the long long data type.) However I am having trouble placing the #pragma acc parallel loop in the correct place. I have used some examples in the nVidia PDFs shared but to no luck. I know that the inner most loop cannot be parallelized. But I would like each of the three threads to own a GPU and do a portion of the work. Note that input and output matrices are defined as global variables as I kept running out of stack memory.
I have tried the code below, but I get compilation errors all pointing to line 75 which is the #pragma acc parallel loop line
[test#server ~]pgcc -acc -mp -ta=tesla:cc60 -Minfo=all -o testGPU matrixMultiplyopenmp.c
PGC-S-0035-Syntax error: Recovery attempted by replacing keyword for by keyword barrier (matrixMultiplyopenmp.c: 75)
PGC-S-0035-Syntax error: Recovery attempted by replacing acc by keyword enum (matrixMultiplyopenmp.c: 76)
PGC-S-0036-Syntax error: Recovery attempted by inserting ';' before keyword for (matrixMultiplyopenmp.c: 77)
PGC/x86-64 Linux 18.10-1: compilation completed with severe errors
Function is:
void multiplyMatrix(long long int matrixA[SIZE][SIZE], long long int matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE])
{
// Get Nvidia device type
acc_init(acc_device_nvidia);
// Get Number of GPUs in system
int num_gpus = acc_get_num_devices(acc_device_nvidia);
//Set the number of OpenMP thread to the number of GPUs
#pragma omp parallel num_threads(num_gpus)
{
//Get thread openMP number and set the GPU device to that number
int threadNum = omp_get_thread_num();
acc_set_device_num(threadNum, acc_device_nvidia);
int row;
int col;
int key;
#pragma omp for
#pragma acc parallel loop
for (row = 0; row < SIZE; row++)
for (col = 0; col < SIZE; col++)
for (key = 0; key < SIZE; key++)
matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
}
}
As fisehara points out, you can't have both an OpenMP "for" loop combined with an OpenACC parallel loop on the same for loop. Instead, you need to manually decompose the work across the OpenMP threads. Example below.
Is there a reason why you want to use multiple GPUs here? Most likely the matrix multiply will fit on to a single GPU so there's no need for the extra overhead of introducing host-side parallelization.
Also, I generally recommend using MPI+OpenACC for multi-gpu programming. Domain decomposition is naturally part of MPI but not inherent in OpenMP. Also, MPI gives you a one-to-one relationship between the host process and accelerator, allows for scaling beyond a single node, and you can take advantage of CUDA Aware MPI for direct GPU to GPU data transfers. For more info, do a web search for "MPI OpenACC" and you'll find several tutorials. Class #2 at https://developer.nvidia.com/openacc-advanced-course is a good resource.
% cat test.c
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#ifdef _OPENACC
#include <openacc.h>
#endif
#define SIZE 130
void multiplyMatrix(long long int matrixA[SIZE][SIZE], long long int matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE])
{
#ifdef _OPENACC
// Get Nvidia device type
acc_init(acc_device_nvidia);
// Get Number of GPUs in system
int num_gpus = acc_get_num_devices(acc_device_nvidia);
#else
int num_gpus = omp_get_max_threads();
#endif
if (SIZE<num_gpus) {
num_gpus=SIZE;
}
printf("Num Threads: %d\n",num_gpus);
//Set the number of OpenMP thread to the number of GPUs
#pragma omp parallel num_threads(num_gpus)
{
//Get thread openMP number and set the GPU device to that number
int threadNum = omp_get_thread_num();
#ifdef _OPENACC
acc_set_device_num(threadNum, acc_device_nvidia);
printf("THID %d using GPU: %d\n",threadNum,threadNum);
#endif
int row;
int col;
int key;
int start, end;
int block_size;
block_size = SIZE/num_gpus;
start = threadNum*block_size;
end = start+block_size;
if (threadNum==(num_gpus-1)) {
// add the residual to the last thread
end = SIZE;
}
printf("THID: %d, Start: %d End: %d\n",threadNum,start,end-1);
#pragma acc parallel loop \
copy(matrixProduct[start:end-start][:SIZE]), \
copyin(matrixA[start:end-start][:SIZE],matrixB[:SIZE][:SIZE])
for (row = start; row < end; row++) {
#pragma acc loop vector
for (col = 0; col < SIZE; col++) {
for (key = 0; key < SIZE; key++) {
matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
}}}
}
}
int main() {
long long int matrixA[SIZE][SIZE];
long long int matrixB[SIZE][SIZE];
long long int matrixProduct[SIZE][SIZE];
int i,j;
for(i=0;i<SIZE;++i) {
for(j=0;j<SIZE;++j) {
matrixA[i][j] = (i*SIZE)+j;
matrixB[i][j] = (j*SIZE)+i;
matrixProduct[i][j]=0;
}
}
multiplyMatrix(matrixA,matrixB,matrixProduct);
printf("Result:\n");
for(i=0;i<SIZE;++i) {
printf("%d: %ld %ld\n",i,matrixProduct[i][0],matrixProduct[i][SIZE-1]);
}
}
% pgcc test.c -mp -ta=tesla -Minfo=accel,mp
multiplyMatrix:
28, Parallel region activated
49, Generating copyin(matrixB[:130][:])
Generating copy(matrixProduct[start:end-start][:131])
Generating copyin(matrixA[start:end-start][:131])
Generating Tesla code
52, #pragma acc loop gang /* blockIdx.x */
54, #pragma acc loop vector(128) /* threadIdx.x */
55, #pragma acc loop seq
54, Loop is parallelizable
55, Complex loop carried dependence of matrixA->,matrixProduct->,matrixB-> prevents parallelization
Loop carried dependence of matrixProduct-> prevents parallelization
Loop carried backward dependence of matrixProduct-> prevents vectorization
59, Parallel region terminated
% a.out
Num Threads: 4
THID 0 using GPU: 0
THID: 0, Start: 0 End: 31
THID 1 using GPU: 1
THID: 1, Start: 32 End: 63
THID 3 using GPU: 3
THID: 3, Start: 96 End: 129
THID 2 using GPU: 2
THID: 2, Start: 64 End: 95
Result:
0: 723905 141340355
1: 1813955 425843405
2: 2904005 710346455
3: 3994055 994849505
...
126: 138070205 35988724655
127: 139160255 36273227705
128: 140250305 36557730755
129: 141340355 36842233805
I ran into an issue with MPI+OpenACC compilation on the shared system I was restricted to and could not upgrade the compiler. The solution I ended up using, was breaking the work down with OMP first then calling an OpenACC function as follows:
//Main code
pragma omp parallel num_threads(num_gpus)
{
#pragma omp for private(tid)
for (tid = 0; tid < num_gpus; tid++)
{
//Get thread openMP number and set the GPU device to that number
int threadNum = omp_get_thread_num();
acc_set_device_num(threadNum, acc_device_nvidia);
// check with thread is using which GPU
int gpu_num = acc_get_device_num(acc_device_nvidia);
printf("Thread # %d is going to use GPU # %d \n", threadNum, gpu_num);
//distribute the uneven rows
if (threadNum < extraRows)
{
startRow = threadNum * (rowsPerThread + 1);
stopRow = startRow + rowsPerThread;
}
else
{
startRow = threadNum * rowsPerThread + extraRows;
stopRow = startRow + (rowsPerThread - 1);
}
// Debug to check allocation of data to threads
//printf("Start row is %d, and Stop rows is %d \n", startRow, stopRow);
GPUmultiplyMatrix(matrixA, matrixB, matrixProduct, startRow, stopRow);
}
}
void GPUmultiplyMatrix(long long int matrixA[SIZE][SIZE], long long int
matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE], int
startRow, int stopRow)
{
int row;
int col;
int key;
#pragma acc parallel loop collapse (2)
for (row = startRow; row <= stopRow; row++)
for (col = 0; col < SIZE; col++)
for (key = 0; key < SIZE; key++)
matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
}

Multithreaded program outputs different results every time it runs

I have been trying to create a Multithreaded program that calculates the multiples of 3 and 5 from 1 to 999 but I can't seem to get it right every time I run it I get a different value I think it might have to do with the fact that I use a shared variable with 10 threads but I have no idea how to get around that. Also The program does work if I calculate the multiples of 3 and 5 from 1 to 9.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#define NUM_THREADS 10
#define MAX 1000
//finds multiples of 3 and 5 and sums up all of the multiples
int main(int argc, char ** argv)
{
omp_set_num_threads(10);//set number of threads to be used in the parallel loop
unsigned int NUMS[1000] = { 0 };
int j = 0;
#pragma omp parallel
{
int ID = omp_get_thread_num();//get thread ID
int i;
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
}
int i = 0;
unsigned int total;
for(i = 0; NUMS[i] != 0; i++)total += NUMS[i];//add up multiples of 3 and 5
printf("Total : %d\n", total);
return 0;
}
"j++" is not an atomic operation.
It means "take the value contained at the storage location called j, use it in the current statement, add one to it, then store it back in the same location it came from".
(That's the simple answer. Optimization and whether or not the value is kept in a register can and will change things even more.)
When you have multiple threads doing that to the same variable all at the same time, you get different and unpredictable results.
You can use thread variables to get around that.
In your code j is a shared inductive variable. You can't rely on using shared inductive variables efficiently with multiple threads (using atomic every iteration is not efficient).
You could find a special solution not using inductive variables (for example using wheel factorization with seven spokes {0,3,5,6,9,10,12} out of 15) or you could find a general solution using private inductive variables like this
#pragma omp parallel
{
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
#pragma omp for schedule(static) ordered
for(i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
{
memcpy(&NUMS[j], NUMS_local, sizeof *NUMS *k);
j += k;
}
}
}
This solution does not make optimal use of memory however. A better solution would use something like std::vector from C++ which you could implement for example using realloc in C but I'm not going to do that for you.
Edit:
Here is a special solution which does not use shared inductive variables using wheel factorization
int wheel[] = {0,3,5,6,9,10,12};
int n = MAX/15;
#pragma omp parallel for reduction(+:total)
for(int i=0; i<n; i++) {
for(int k=0; k<7; k++) {
NUMS[7*i + k] = 7*i + wheel[k];
total += NUMS[7*i + k];
}
}
//now clean up for MAX not a multiple of 15
int j = n*7;
for(int i=n*15; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS[j++] = i;
total += i;
}
}
Edit: It's possible to do this without a critical section (from the ordered clause). This does memcpy in parallel and also makes better use of memory at least for the shared array.
int *NUMS;
int *prefix;
int total=0, j;
#pragma omp parallel
{
int i;
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
{
prefix = malloc(sizeof *prefix * (nthreads+1));
prefix[0] = 0;
}
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
prefix[ithread+1] = k;
#pragma omp barrier
#pragma omp single
{
for(i=1; i<nthreads+1; i++) prefix[i+1] += prefix[i];
NUMS = malloc(sizeof *NUMS * prefix[nthreads]);
j = prefix[nthreads];
}
memcpy(&NUMS[prefix[ithread]], NUMS_local, sizeof *NUMS *k);
}
free(prefix);
This is a typical thread synchronization issue. All you need to do is using a kernel synchronization object for the sake of atomicity of any desired operation (incrementing the value of variable j in your case). It would be a mutex, semaphore or an event object depending on the operating system you're working on. But whatever your development environment is, to provide atomicity, the fundamental flow logic should be like the following pseudo-code:
{
lock(kernel_object)
// ...
// do your critical operation (increment your variable j in your case)
// ++j;
// ...
unlock(kernel_object)
}
If you're working on Windows operating system, there are some special synchronization mechanisms provided by the environment (i.e: InterlockedIncrement or CreateCriticalSection etc.) If you're working on a Unix/Linux based operating system, you can use mutex or semaphore kernel synchronization objects. Actually all those synchronization mechanism are stem from the concept of semaphores which is invented by Edsger W. Dijkstra in the begining of 1960's.
Here's some basic examples below:
Linux
#include <pthread.h>
pthread_mutex_t g_mutexObject = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char* argv[])
{
// ...
pthread_mutex_lock(&g_mutexObject);
++j; // incrementing j atomically
pthread_mutex_unlock(&g_mutexObject);
// ...
pthread_mutex_destroy(&g_mutexObject);
// ...
exit(EXIT_SUCCESS);
}
Windows
#include <Windows.h>
CRITICAL_SECTION g_csObject;
int main(void)
{
// ...
InitializeCriticalSection(&g_csObject);
// ...
EnterCriticalSection(&g_csObject);
++j; // incrementing j atomically
LeaveCriticalSection(&g_csObject);
// ...
DeleteCriticalSection(&g_csObject);
// ...
exit(EXIT_SUCCESS);
}
or just simply:
#include <Windows.h>
LONG volatile g_j; // our little j must be volatile in here now
int main(void)
{
// ...
InterlockedIncrement(&g_j); // incrementing j atomically
// ...
exit(EXIT_SUCCESS);
}
The problem you have is that threads doesn't necesarlly execute in order so the last thread to wirete may not have read the value in order so you overwrite wrong data.
There is a form to set that the threads in a loop, do a sumatory when they finish with the openmp options. You have to wirte somthing like this to use it.
#pragma omp parallel for reduction(+:sum)
for(k=0;k<num;k++)
{
sum = sum + A[k]*B[k];
}
/* Fin del computo */
gettimeofday(&fin,NULL);
all you have to do is write the result in "sum", this is from an old code i have that do a sumatory.
The other option you have is the dirty one. Someway, make the threads wait and get in order using a call to the OS. This is easier than it looks. This will be a solution.
#pragma omp parallel
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
printf("asdasdasdasdasdasdasdas");
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
but i recommendo you to read fully the openmp options.

Resources