Parallel for inside a while using OpenMP on C - c

I'm trying to do a parallel for inside a while, somothing like this:
while(!End){
for(...;...;...) // the parallel for
...
// serial code
}
The for loop is the only parallel section of the while loop. If I do this, I have a lot of overhead:
cycles = 0;
while(!End){ // 1k Million iterations aprox
#pragma omp parallel for
for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
if(time[i] == cycles){
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
}
// serial code
++cycles;
}
Each iteration of the for loop are indepent with each other.
There are dependencies between serial code and parallel code.

So normally one doesn't have to worry too much about putting parallel regions into loops, as modern openmp implementations are pretty efficient about using things like thread teams and as long as there's lots of work in the loop you're fine. But here, with an outer loop count of ~1e9 and an inner loop count of ~256 - and very little work being done per iteration - the overhead is likely comparable to or worse than the amount of work being done and performance will suffer.
So there will be a noticeable difference between this:
cycles = 0;
while(!End){ // 1k Million iterations aprox
#pragma omp parallel for
for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
if(time[i] == cycles){
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
}
// serial code
++cycles;
}
and this:
cycles = 0;
#pragma omp parallel
while(!End){ // 1k Million iterations aprox
#pragma omp for
for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
if(time[i] == cycles){
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
}
// serial code
#pragma omp single
{
++cycles;
}
}
But really, that scan across the time array every iteration is unfortunately both (a) slow and (b) not enough work to keep multiple cores busy - it's memory intensive. With more than a couple of threads you will actually have worse performance than serial, even without overheads, just because of memory contention. Admittedly what you have posted here is just an example, not your real code, but why don't you preprocess the time array so you can just check to see when the next task is ready to update:
#include <stdio.h>
#include <stdlib.h>
struct tasktime_t {
long int time;
int task;
};
int stime_compare(const void *a, const void *b) {
return ((struct tasktime_t *)a)->time - ((struct tasktime_t *)b)->time;
}
int main(int argc, char **argv) {
const int n=256;
const long int niters = 100000000l;
long int time[n];
int wbusy[n];
int wfinished[n];
for (int i=0; i<n; i++) {
time[i] = rand() % niters;
wbusy[i] = 1;
wfinished[i] = 0;
}
struct tasktime_t stimes[n];
for (int i=0; i<n; i++) {
stimes[i].time = time[i];
stimes[i].task = i;
}
qsort(stimes, n, sizeof(struct tasktime_t), stime_compare);
long int cycles = 0;
int next = 0;
while(cycles < niters){ // 1k Million iterations aprox
while ( (next < n) && (stimes[next].time == cycles) ) {
int i = stimes[next].task;
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
next++;
}
++cycles;
}
return 0;
}
This is ~5 times faster than the serial version of the scanning approach (and much faster than the OpenMP versions). Even if you are constantly updating the time/wbusy/wfinished arrays in the serial code, you can keep track of their completion times using a priority queue with each update taking O(ln(N)) time instead of scanning every iteration taking O(N) time.

Related

Parallel for is less efficient than serial?

I´ve been working on a small project for my college, with C and Openmp. To make it short, when trying to parallelize a for loop using the #pragma omp parallel for constructor, it ends up being way slower than the serial version, just by adding that, is a parallel version of odd-even sort that works with an array of integers
I found it has something to do with the threads accessing the memory location of the whole array each time they compare numbers and updating its own copy of it on the cache memory. But I don´t know how to fix it, so rather than updating the whole array, they just check the exact location of the integers they are comparing, I´m kinda new using Openmp so idk if there´s a clause of constructor for this kind of situation.
//version without parallel for
void bubbleSortParalelo(int array[], int size) {
int i,j,first;
for (i = 0; i < size; i++){
first = i % 2;
for (j = first; j < size-1 ; j+= 2){
if (array[j] > array[j+1]){
int temp = array[j+1];
array[j+1]=array[j];
array[j]= temp;
}
}
}
}
//Version with parallel for, takes longer somehow
void bubbleSortParalelo2(int array[], int size) {
int i,j,first;
for (i = 0; i < size; i++){
first = i % 2;
#pragma omp parallel for
for (j = first; j < size-1 ; j+= 2){
if (array[j] > array[j+1]){
int temp = array[j+1];
array[j+1]=array[j];
array[j]= temp;
}
}
}
I want to make the parallel version at least as efficient as the serial one, because right now it takes like 10 times more, becoming worse with the more threads I use.

Why OpenMP reduction is slower than MPI on share memory structure?

I have tried to test OpenMP and MPI parallel implementation for inner products of two vectors (element values are computed on the fly) and find out that OpenMP is slower than MPI.
The MPI code I am using is as following,
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
#include <mpi.h>
int main(int argc, char* argv[])
{
double ttime = -omp_get_wtime();
int np, my_rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &np);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
int n = 10000;
int repeat = 10000;
int sublength = (int)(ceil((double)(n) / (double)(np)));
int nstart = my_rank * sublength;
int nend = nstart + sublength;
if (nend >n )
{
nend = n;
sublength = nend - nstart;
}
double dot = 0;
double sum = 1;
int j, k;
double time = -omp_get_wtime();
for (j = 0; j < repeat; j++)
{
double loc_dot = 0;
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
MPI_Allreduce(&loc_dot, &dot, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
sum += (dot/(double)(n));
}
time += omp_get_wtime();
if (my_rank == 0)
{
ttime += omp_get_wtime();
printf("np = %d sum = %f, loop time = %f sec, total time = %f \n", np, sum, time, ttime);
}
return 0;
}
I have tried several different implementation with OpenMP.
Here is the version which not to complicate and close to best performance I can achieve.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
int main(int argc, char* argv[])
{
int n = 10000;
int repeat = 10000;
int np = 1;
if (argc > 1)
{
np = atoi(argv[1]);
}
omp_set_num_threads(np);
int nstart =0;
int sublength =n;
double loc_dot = 0;
double sum = 1;
#pragma omp parallel
{
int i, j, k;
double time = -omp_get_wtime();
for (j = 0; j < repeat; j++)
{
#pragma omp for reduction(+: loc_dot)
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
#pragma omp single
{
sum += (loc_dot/(double)(n));
loc_dot =0;
}
}
time += omp_get_wtime();
#pragma omp single nowait
printf("sum = %f, time = %f sec, np = %d\n", sum, time, np);
}
return 0;
}
here is my test results:
OMP
sum = 6992.953984, time = 0.409850 sec, np = 1
sum = 6992.953984, time = 0.270875 sec, np = 2
sum = 6992.953984, time = 0.186024 sec, np = 4
sum = 6992.953984, time = 0.144010 sec, np = 8
sum = 6992.953984, time = 0.115188 sec, np = 16
sum = 6992.953984, time = 0.195485 sec, np = 32
MPI
sum = 6992.953984, time = 0.381701 sec, np = 1
sum = 6992.953984, time = 0.243513 sec, np = 2
sum = 6992.953984, time = 0.158326 sec, np = 4
sum = 6992.953984, time = 0.102489 sec, np = 8
sum = 6992.953984, time = 0.063975 sec, np = 16
sum = 6992.953984, time = 0.044748 sec, np = 32
Can anyone tell me what I am missing?
thanks!
update:
I have written an acceptable reduce function for OMP. the perfomance is close to MPI reduce function now. the code is as following.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
double darr[2][64];
int nreduce=0;
#pragma omp threadprivate(nreduce)
double OMP_Allreduce_dsum(double loc_dot,int tid,int np)
{
darr[nreduce][tid]=loc_dot;
#pragma omp barrier
double dsum =0;
int i;
for (i=0; i<np; i++)
{
dsum += darr[nreduce][i];
}
nreduce=1-nreduce;
return dsum;
}
int main(int argc, char* argv[])
{
int np = 1;
if (argc > 1)
{
np = atoi(argv[1]);
}
omp_set_num_threads(np);
double ttime = -omp_get_wtime();
int n = 10000;
int repeat = 10000;
#pragma omp parallel
{
int tid = omp_get_thread_num();
int sublength = (int)(ceil((double)(n) / (double)(np)));
int nstart = tid * sublength;
int nend = nstart + sublength;
if (nend >n )
{
nend = n;
sublength = nend - nstart;
}
double sum = 1;
double time = -omp_get_wtime();
int j, k;
for (j = 0; j < repeat; j++)
{
double loc_dot = 0;
for (k = 0; k < sublength; k++)
{
double temp = sin((sum+ nstart +k +j)/(double)(n));
loc_dot += (temp * temp);
}
double dot =OMP_Allreduce_dsum(loc_dot,tid,np);
sum +=(dot/(double)(n));
}
time += omp_get_wtime();
#pragma omp master
{
ttime += omp_get_wtime();
printf("np = %d sum = %f, loop time = %f sec, total time = %f \n", np, sum, time, ttime);
}
}
return 0;
}
First of all, this code is very sensitive to synchronization overheads (both software and hardware) resulting in apparent strange behaviors themselves to both the OpenMP runtime implementation and low-level processor operations (eg. cache/bus effects). Indeed, a full synchronization is required for each iteration of the j-based loop executed every 45 ms. This means 4.5 us/iteration. In such a short time, the partial-sum spread in 32 cores needs to be reduced and broadcasted. If each core accumulates its own value in a shared atomic location, taking for example 60 ns per atomic add (realistic overhead for atomics on scalable Xeon processors), it would take 32 * 60 ns = 1.92 us since this process is done sequentially on x86 processors so far. This small additional time represent an overhead of 43% on the overall execution time because of the barriers! Due to contention on atomic variables, timings are often much worse. Moreover, the barrier themselves are expensive (they are often implemented using atomics in OpenMP runtimes but in a way that could scale a bit better).
The first OpenMP implementation was slow because implicit synchronizations and complex hardware cache effects. Indeed, the omp for reduction directive performs an implicit barrier at the end of its region as well as omp single. The reduction itself can implemented in several ways. The OpenMP runtime of ICC use a clever tree-based atomic implementation which should scale quite well (but not perfectly). Moreover, the omp single section will cause some cache-line bouncing. Indeed, the result loc_dot will likely be stored in the cache of the last core updating it while the thread executing this section will likely scheduled on another core. In this case, the processor has to move the cache-line from one L2 cache to another (or load the value from the L3 cache directly regarding the hardware state). The same thing also apply for sum (which tends to move between cores as the thread executing the section will likely not be always scheduled on the same core). Finally, the sum variable must be broadcasted on each core so they can start a new iteration.
The last OpenMP implementation is significantly better since every thread works on its own local data, it uses only one barrier (this synchronization is mandatory regarding the algorithm) and caches are better used. The accumulation part may not be ideal as all cores will likely fetch data previously located on all other L1/L2 caches causing a all-to-all broadcast pattern. This hardware-operation can scale barely but should be sequential either.
Note that the last OpenMP implementation suffer from false-sharing. Indeed, items of darr will be stored contiguously in memory and share the same cache-line. As a result, when a thread writes in darr, the associated core will request the cache-line and invalidates the ones located on others cores. This causes cache-line bouncing between cores. However, on current x86 processors, cache lines are 64 bytes wise and a double variable takes 8 bytes resulting in 8 items per cache-line. Thus, it mitigates the effect cache-line bouncing typically to 8 cores over the 32 ones. That being said, the item packing has some benefits as only 4 cache-lines fetch are required per core to perform the global accumulation. To prevent false-sharing, one can allocate a (8 times) bigger array and reserve some space between items so that 1 item is stored per cache-line. The best strategy on your target processor may to use a tree-based atomic reduction like the one the ICC OpenMP runtime use. Ideally, the sum reduction and the barrier can be merged together for better performance. This is what the MPI implementation can do internally (MPI_Allreduce).
Note that all implementations suffer from the very high thread synchronization. This is a problem as some context switch regularly occurs on some core because of some operating-system/hardware events (network, storage device, user, system processes, etc.). One critical issue is frequency-scaling on any modern x86 processors: not all core will work at the same frequency and their frequency change over time. The slowest thread will slow down all the others because of the barrier. In the worst case, some threads may passively wait enabling some cores to sleep (C-states) and then take more time to wake up slowing further down the others depending on the platform configuration.
The takeaway is:
the more synchronized a code is, the lower its scaling and the challenging its optimization.

Mixing OpenMP and xmmintrin SSE Intrinsics - not getting speedup over the non-parallel version

I've implemented a version of the Travelling Salesman with xmmintrin.h SSE instructions, received a decent speedup. But now I'm also trying to implement OpenMP threading on top of it, and I'm seeing a pretty drastic slow down. I'm getting the correct answer in both cases (i.e. (i) with SSE only, or (ii) with SSE && OpenMP).
I know I am probably doing something wildly wrong, and maybe someone much more experienced than me can spot the issue.
The main loop of my program has the following (brief) pseudocode:
int currentNode;
for(int i = 0; i < numNodes; i++) {
minimumDistance = DBL_MAX;
minimumDistanceNode;
for(int j = 0; j < numNodes; j++) {
// find distance between 'currentNode' to j-th node
// ...
if(jthNodeDistance < minimumDistance) {
minimumDistance = jthNodeDistance;
minimumDistanceNode = jthNode;
}
}
currentNode = minimumDistanceNode;
}
And here is my implementation, that is still semi-pseudocode as I've still brushed over some parts that I don't think have an impact on performance, I think the issues to be found with my code can be found in the following code snippet. If you just omit the #pragma lines, then the following is pretty much identical to the SSE only version of the same program, so I figure I should only include the OpenMP version:
int currentNode = 0;
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++) {
miniumum = DBL_MAX;
__m128 currentNodeX = _mm_set1_ps(xCoordinates[currentNode]);
__m128 currentNodeY = _mm_set1_ps(yCoordinates[currentNode]);
#pragma omp parallel num_threads(omp_get_max_threads())
{
float localMinimum = DBL_MAX;
float localMinimumNode;
#pragma omp for
for (int j = 0; j < loopEnd; j += 4) {
// a number of SSE vector calculations to find distance
// between the current node and the four nodes we're looking
// at in this iteration of the loop:
__m128 subXs_0 = _mm_sub_ps(currentNodeX, _mm_load_ps(&xCoordinates[j]));
__m128 squareSubXs_0 = _mm_mul_ps(subXs_0, subXs_0);
__m128 subYs_0 = _mm_sub_ps(currentNodeY, _mm_load_ps(&yCoordinates[j]));
__m128 squareSubYs_0 = _mm_mul_ps(subYs_0, subYs_0);
__m128 addXY_0 = _mm_add_ps(squareSubXs_0, squareSubYs_0);
float temp[unroll];
_mm_store_ps(&temp[0], addXY_0);
// skipping stuff here that is about getting the minimum distance and
// it's equivalent node, don't think it's massively relevant but
// each thread will have its own
// localMinimum
// localMinimumNode
}
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
So my logic is:
omp single for the top-level for(int i) for loop, and I only want 1 thread dedicated to this
omp parallel num_threads(omp_get_max_threads()) for the inner for(int j) for-loop, as I want all cores working on this part of the code at the same time.
omp critical at the end of the full for(int j) loop, as I want to thread-safely update the current node.
In terms of run-time, the OpenMP version is typically twice as slow as the SSE-only version.
Does anything jump out at you as particularly bad in my code, that is causing this drastic slow-down for OpenMP?
Does anything jump out at you as particularly bad in my code, that is
causing this drastic slow-down for OpenMP?
First:
omp single for the top-level for(int i) for loop, and I only want 1
thread dedicated to this
In your code you have the following:
#pragma omp parallel
{
#pragma omp single
{
for (int i = 1; i < totalNum; i++)
{
#pragma omp parallel num_threads(omp_get_max_threads())
{
//....
}
// within the 'omp single'
ThisPt = minimumNode;
}
}
}
The #pragma omp parallel creates a team of threads, but then only one thread executes a parallel task (i.e., #pragma omp single) while the other threads don't do anything. You can simplified to:
for (int i = 1; i < totalNum; i++)
{
#pragma omp parallel num_threads(omp_get_max_threads())
{
//....
}
ThisPt = minimumNode;
}
The inner only is still executed by only one thread.
Second :
omp parallel num_threads(omp_get_max_threads()) for the inner for(int
j) for-loop, as I want all cores working on this part of the code at
the same time.
The problem is that this might return the number of logic-cores and not physical cores, and some codes might perform worse with hyper-threading. So, I would first test with a different number of threads, starting from 2, 4 and so on, until you find a number to which the code stops scaling.
omp critical at the end of the full for(int j) loop, as I want to
thread-safely update the current node.
// updating the global minimumNode in a thread-safe way
#pragma omp critical (update_minimum)
{
if (localMinimum < minimum) {
minimum = localMinimum;
minimumNode = localMinimumNode;
}
}
this can be replaced by creating an array where each thread save its local minimum in a position reserved to that thread, and outside the parallel region the initial thread extract the minimum and minimumNode:
int total_threads = /..;
float localMinimum[total_threads] = {DBL_MAX};
float localMinimumNode[total_threads] = {DBL_MAX};
#pragma omp parallel num_threads(total_threads)
{
/...
}
for(int i = 0; i < total_threads; i++){
if (localMinimum[i] < minimum) {
minimum = localMinimum[i];
minimumNode = localMinimumNode[i];
}
}
Finally, after those changes are done, you try to check if it is possible to replace this parallelization by the following:
#pragma omp parallel for
for (int i = 1; i < totalNum; i++)
{
...
}

Multithreaded program outputs different results every time it runs

I have been trying to create a Multithreaded program that calculates the multiples of 3 and 5 from 1 to 999 but I can't seem to get it right every time I run it I get a different value I think it might have to do with the fact that I use a shared variable with 10 threads but I have no idea how to get around that. Also The program does work if I calculate the multiples of 3 and 5 from 1 to 9.
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <string.h>
#define NUM_THREADS 10
#define MAX 1000
//finds multiples of 3 and 5 and sums up all of the multiples
int main(int argc, char ** argv)
{
omp_set_num_threads(10);//set number of threads to be used in the parallel loop
unsigned int NUMS[1000] = { 0 };
int j = 0;
#pragma omp parallel
{
int ID = omp_get_thread_num();//get thread ID
int i;
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
}
int i = 0;
unsigned int total;
for(i = 0; NUMS[i] != 0; i++)total += NUMS[i];//add up multiples of 3 and 5
printf("Total : %d\n", total);
return 0;
}
"j++" is not an atomic operation.
It means "take the value contained at the storage location called j, use it in the current statement, add one to it, then store it back in the same location it came from".
(That's the simple answer. Optimization and whether or not the value is kept in a register can and will change things even more.)
When you have multiple threads doing that to the same variable all at the same time, you get different and unpredictable results.
You can use thread variables to get around that.
In your code j is a shared inductive variable. You can't rely on using shared inductive variables efficiently with multiple threads (using atomic every iteration is not efficient).
You could find a special solution not using inductive variables (for example using wheel factorization with seven spokes {0,3,5,6,9,10,12} out of 15) or you could find a general solution using private inductive variables like this
#pragma omp parallel
{
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
#pragma omp for schedule(static) ordered
for(i=0; i<omp_get_num_threads(); i++) {
#pragma omp ordered
{
memcpy(&NUMS[j], NUMS_local, sizeof *NUMS *k);
j += k;
}
}
}
This solution does not make optimal use of memory however. A better solution would use something like std::vector from C++ which you could implement for example using realloc in C but I'm not going to do that for you.
Edit:
Here is a special solution which does not use shared inductive variables using wheel factorization
int wheel[] = {0,3,5,6,9,10,12};
int n = MAX/15;
#pragma omp parallel for reduction(+:total)
for(int i=0; i<n; i++) {
for(int k=0; k<7; k++) {
NUMS[7*i + k] = 7*i + wheel[k];
total += NUMS[7*i + k];
}
}
//now clean up for MAX not a multiple of 15
int j = n*7;
for(int i=n*15; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS[j++] = i;
total += i;
}
}
Edit: It's possible to do this without a critical section (from the ordered clause). This does memcpy in parallel and also makes better use of memory at least for the shared array.
int *NUMS;
int *prefix;
int total=0, j;
#pragma omp parallel
{
int i;
int nthreads = omp_get_num_threads();
int ithread = omp_get_thread_num();
#pragma omp single
{
prefix = malloc(sizeof *prefix * (nthreads+1));
prefix[0] = 0;
}
int k = 0;
unsigned int NUMS_local[MAX] = {0};
#pragma omp for schedule(static) nowait reduction(+:total)
for(i=0; i<MAX; i++) {
if(i%5==0 || i%3==0) {
NUMS_local[k++] = i;
total += i;
}
}
prefix[ithread+1] = k;
#pragma omp barrier
#pragma omp single
{
for(i=1; i<nthreads+1; i++) prefix[i+1] += prefix[i];
NUMS = malloc(sizeof *NUMS * prefix[nthreads]);
j = prefix[nthreads];
}
memcpy(&NUMS[prefix[ithread]], NUMS_local, sizeof *NUMS *k);
}
free(prefix);
This is a typical thread synchronization issue. All you need to do is using a kernel synchronization object for the sake of atomicity of any desired operation (incrementing the value of variable j in your case). It would be a mutex, semaphore or an event object depending on the operating system you're working on. But whatever your development environment is, to provide atomicity, the fundamental flow logic should be like the following pseudo-code:
{
lock(kernel_object)
// ...
// do your critical operation (increment your variable j in your case)
// ++j;
// ...
unlock(kernel_object)
}
If you're working on Windows operating system, there are some special synchronization mechanisms provided by the environment (i.e: InterlockedIncrement or CreateCriticalSection etc.) If you're working on a Unix/Linux based operating system, you can use mutex or semaphore kernel synchronization objects. Actually all those synchronization mechanism are stem from the concept of semaphores which is invented by Edsger W. Dijkstra in the begining of 1960's.
Here's some basic examples below:
Linux
#include <pthread.h>
pthread_mutex_t g_mutexObject = PTHREAD_MUTEX_INITIALIZER;
int main(int argc, char* argv[])
{
// ...
pthread_mutex_lock(&g_mutexObject);
++j; // incrementing j atomically
pthread_mutex_unlock(&g_mutexObject);
// ...
pthread_mutex_destroy(&g_mutexObject);
// ...
exit(EXIT_SUCCESS);
}
Windows
#include <Windows.h>
CRITICAL_SECTION g_csObject;
int main(void)
{
// ...
InitializeCriticalSection(&g_csObject);
// ...
EnterCriticalSection(&g_csObject);
++j; // incrementing j atomically
LeaveCriticalSection(&g_csObject);
// ...
DeleteCriticalSection(&g_csObject);
// ...
exit(EXIT_SUCCESS);
}
or just simply:
#include <Windows.h>
LONG volatile g_j; // our little j must be volatile in here now
int main(void)
{
// ...
InterlockedIncrement(&g_j); // incrementing j atomically
// ...
exit(EXIT_SUCCESS);
}
The problem you have is that threads doesn't necesarlly execute in order so the last thread to wirete may not have read the value in order so you overwrite wrong data.
There is a form to set that the threads in a loop, do a sumatory when they finish with the openmp options. You have to wirte somthing like this to use it.
#pragma omp parallel for reduction(+:sum)
for(k=0;k<num;k++)
{
sum = sum + A[k]*B[k];
}
/* Fin del computo */
gettimeofday(&fin,NULL);
all you have to do is write the result in "sum", this is from an old code i have that do a sumatory.
The other option you have is the dirty one. Someway, make the threads wait and get in order using a call to the OS. This is easier than it looks. This will be a solution.
#pragma omp parallel
for(i = ID + 1;i < MAX; i+= NUM_THREADS)
{
printf("asdasdasdasdasdasdasdas");
if( i % 5 == 0 || i % 3 == 0)
{
NUMS[j++] = i;//Store Multiples of 3 and 5 in an array to sum up later
}
}
but i recommendo you to read fully the openmp options.

Why is my parallel code slower than sequential code?

I have implemented a parallel code in C for merge sort using OPENMP. I get speed up of 3.9 seconds which is quite slower that the sequential version of the same code(for which i get 3.6). I am trying to optimise the code to the best possible state but cant increase the speedup. Can you please help out with this? Thanks.
void partition(int arr[],int arr1[],int low,int high,int thread_count)
{
int tid,mid;
#pragma omp if
if(low<high)
{
if(thread_count==1)
{
mid=(low+high)/2;
partition(arr,arr1,low,mid,thread_count);
partition(arr,arr1,mid+1,high,thread_count);
sort(arr,arr1,low,mid,high);
}
else
{
#pragma omp parallel num_threads(thread_count)
{
mid=(low+high)/2;
#pragma omp parallel sections
{
#pragma omp section
{
partition(arr,arr1,low,mid,thread_count/2);
}
#pragma omp section
{
partition(arr,arr1,mid+1,high,thread_count/2);
}
}
}
sort(arr,arr1,low,mid,high);
}
}
}
As was correctly noted, there are several mistakes in your code that prevent its correct execution, so I would first suggest to review these errors.
Anyhow, taking into account only how OpenMP performance scales with thread, maybe an implementation based on task directives would fit better as it overcomes the limits already pointed by a previous answer:
Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause
You can find a trace of such an implementation below:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/time.h>
void getTime(double *t) {
struct timeval tv;
gettimeofday(&tv, 0);
*t = tv.tv_sec + (tv.tv_usec * 1e-6);
}
int compare( const void * pa, const void * pb ) {
const int a = *((const int*) pa);
const int b = *((const int*) pb);
return (a-b);
}
void merge(int * array, int * workspace, int low, int mid, int high) {
int i = low;
int j = mid + 1;
int l = low;
while( (l <= mid) && (j <= high) ) {
if( array[l] <= array[j] ) {
workspace[i] = array[l];
l++;
} else {
workspace[i] = array[j];
j++;
}
i++;
}
if (l > mid) {
for(int k=j; k <= high; k++) {
workspace[i]=array[k];
i++;
}
} else {
for(int k=l; k <= mid; k++) {
workspace[i]=array[k];
i++;
}
}
for(int k=low; k <= high; k++) {
array[k] = workspace[k];
}
}
void mergesort_impl(int array[],int workspace[],int low,int high) {
const int threshold = 1000000;
if( high - low > threshold ) {
int mid = (low+high)/2;
/* Recursively sort on halves */
#ifdef _OPENMP
#pragma omp task
#endif
mergesort_impl(array,workspace,low,mid);
#ifdef _OPENMP
#pragma omp task
#endif
mergesort_impl(array,workspace,mid+1,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
/* Merge the two sorted halves */
#ifdef _OPENMP
#pragma omp task
#endif
merge(array,workspace,low,mid,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
} else if (high - low > 0) {
/* Coarsen the base case */
qsort(&array[low],high-low+1,sizeof(int),compare);
}
}
void mergesort(int array[],int workspace[],int low,int high) {
#ifdef _OPENMP
#pragma omp parallel
#endif
{
#ifdef _OPENMP
#pragma omp single nowait
#endif
mergesort_impl(array,workspace,low,high);
}
}
const size_t largest = 100000000;
const size_t length = 10000000;
int main(int argc, char *argv[]) {
int * array = NULL;
int * workspace = NULL;
double start,end;
printf("Largest random number generated: %d \n",RAND_MAX);
printf("Largest random number after truncation: %d \n",largest);
printf("Array size: %d \n",length);
/* Allocate and initialize random vector */
array = (int*) malloc(length*sizeof(int));
workspace = (int*) malloc(length*sizeof(int));
for( int ii = 0; ii < length; ii++)
array[ii] = rand()%largest;
/* Sort */
getTime(&start);
mergesort(array,workspace,0,length-1);
getTime(&end);
printf("Elapsed time sorting: %g sec.\n", end-start);
/* Check result */
for( int ii = 1; ii < length; ii++) {
if( array[ii] < array[ii-1] ) printf("Error:\n%d %d\n%d %d\n",ii-1,array[ii-1],ii,array[ii]);
}
free(array);
free(workspace);
return 0;
}
Notice that if you seek performances you also have to guarantee that the base case of your recursion is coarse enough to avoid substantial overhead due to recursive function calls. Other than that, I would suggest to profile your code so you can have a good hint on which parts are really worth optimizing.
It took some figuring out, which is a bit embarassing, since when you see it, the answer is so simple.
As it stands in the question, the program doesn't work correctly, instead it randomly on some runs duplicates some numbers and loses others. This appears to be a totally parallel error, that doesn't arise when running the program with the variable thread_count == 1.
The pragma "parallel sections", is a combined parallel and sections directive, which in this case means, that it starts a second parallel region inside the previous one. Parallel regions inside other parallel regions are fine, but I think most implementation don't give you extra threads when they encounter a nested parallel region.
The fix is to replace
#pragma omp parallel sections
with
#pragma omp sections
After this fix, the program starts to give correct answers, and with a two core system and for a million numbers I get for timing the following results.
One thread:
time taken: 0.378794
Two threads:
time taken: 0.203178
Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause, so change num_threads(thread_count) -> num_threads(2)
But because of the fact that at least the two implementations I tried are not able to spawn new threads for nested parallel regions, the program as it stands doesn't scale to more than two threads.

Resources