OpenMP: Access violation and other errors - c

Preface
Recently, I implemented OpenMP into our group's project code. Main runs in two for loops; the outer controls the 'run', while the inner controls the 'generation.' Generations are completely independent from different runs, though dependent on other generations in the same run.
The idea is to parallelize the outer loop, the 'run' loop, while letting each thread maintain evolution of generations on whatever specific run number it was assigned to.
The Problem
When setting OMP_THREADS = 1 , i.e. letting the program run with only one thread, it runs without a hitch. If this number is any higher, I get the following error:
Unhandled exception at 0x00F5C4C3 in projectc.exe: 0xC0000005: Access violation writing location 0x00000072.
with the following appearing in the "Autos" section of Visual Studio:
(Note: t, t->active_cells, and t->cellx are "error red" while the rest are white when I get this error)
If I change default(none) to default(shared) in the #pragma right above the outer loop, and remove t, s, and bn from threadprivate (these are structures initialized in external files), then the program runs normally for a generation on each thread before freezing (though CPU activity shows that both threads are still running with the same intensity as before).
Attempts at Solutions
I cannot figure out what is going wrong. Trying a simple #pragma omp parallel for outside of the outer loop of course doesn't work, but I have also tried declaring all of main as #pragma omp parallel and the outer loop as #pragma omp for. A few other subtle approaches were tried like this as well, which leads me to the conclusions that it must be something to do with the way the variables are shared between threads...because all runs, and so threads, are independent, really all of the variables could be set as private; though there is some overlap that you see reflected in shared(..).
The code is attached below.
main.c
/* General Includes */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <omp.h>
/* Project Includes */
#include "main.h"
#include "randgen.h"
#include "board7.h"
#include "tissue.h"
#include "io.h"
#define BitFlp(arg,posn) ((arg) ^ (1L << (posn)))
#define BitClr(arg,posn) ((arg) & ~(1L << (posn)))
#define display_dbg 1 //Controls whether print statements in main.c are displayed.
#define display_time 1 //Controls whether timing print statements are executed.
#define BILLION 1000000000L;
#define num_runs 10 //Controls number of runs per simulation
#define num_gens 4000//Controls number of generations per run
#define OMP_THREADS 1 // Max number of threads used if OpenMP is enabled
int n, i, r, j, z, x, sxa, y, flagb, m;
int j1, j2;
char a;
int max_fit_gen, collect_data, lb_run, w, rn, sx;
float f, max_fitness;
tissuen *fx;
input_vec dx;
calookup ra;
#pragma omp threadprivate(n, r, j, x, z, sxa, y, flagb, m, \
j1, j2, a, max_fit_gen, collect_data, lb_run, w, \
rn, sx, f, max_fitness, fx, dx, ra, run_data, t, s, bn)
int main(int argc, char *argv[])
{
int* p = 0x00000000; // pointer to NULL
char sa[256];
char ss[10];
long randn;
boardtable ba;
srand((unsigned)time(NULL));
init_mm();
randn = number_range(1, 100);
#ifdef OS_WINDOWS
// Timing parameters
LARGE_INTEGER clk_freq;
LARGE_INTEGER t1, t2, t3;
#endif
#ifdef OS_UNIX
struct timespec clk_freq, t1, t2, t3;
#endif
double avg_gen_time, avg_run_time, run_time, sim_time, est_run_time, est_sim_time;
// File System and IO Parameters
char cwd[FILENAME_MAX];
getcwd(&cwd, sizeof(cwd));
char curState[FILENAME_MAX];
char recState[FILENAME_MAX];
char recMode[FILENAME_MAX];
char curGen[FILENAME_MAX];
char curRun[FILENAME_MAX];
char genTmp[FILENAME_MAX];
strcpy(curState, cwd);
strcpy(recState, cwd);
strcpy(recMode, cwd);
strcpy(curGen, cwd);
strcpy(curRun, cwd);
strcpy(genTmp, cwd);
#ifdef OS_WINDOWS
strcat(curState, "\\current.txt");
strcat(recState, "\\recover.txt");
strcat(recMode, "\\recovermode.txt");
strcat(curGen, "\\gen.txt");
strcat(curRun, "\\run");
strcat(genTmp, "\\tmp\\gentmp");
#endif
#ifdef OS_UNIX
strcat(curState, "/current.txt");
strcat(recState, "/recover.txt");
strcat(recMode, "/recovermode.txt");
strcat(curGen, "/gen.txt");
strcat(curRun, "/run");
strcat(genTmp, "/tmp/gentmp");
#endif
//Read current EA run variables (i.e. current run number, generation, recover mode status)
z = readorcreate(curState);
x = readorcreate(recState);
sxa = readorcreate(recMode);
y = readorcreate(curGen);
//Initialize simulation parameters
s.count = 0;
s.x[0] = 0;
s.y[0] = 0;
s.addvec[0] = 0;
s.bestnum = 0;
s.countb = 0;
s.count = 0;
initialize_sim_param(&s, 0, 200);
collect_data = 0;
//Build a collection of experiment initial conditions
buildboardcollection7(&bn);
//Determine clock frequency.
#ifdef OS_WINDOWS
if (display_time) get_frequency(&clk_freq);
#endif
#ifdef OS_UNIX
if (display_time) get_frequency(CLOCK_REALTIME, &clk_freq);
#endif
//Start simulation timer
#ifdef OS_WINDOWS
if (display_time) read_clock(&t1);
#endif
#ifdef OS_UNIX
if (display_time) read_clock(CLOCK_REALTIME, &t1);
#endif
#pragma omp parallel for schedule(static) default(none) num_threads(OMP_THREADS) \
private(sa, ss, randn, ba, t2, t3, avg_gen_time, avg_run_time, sim_time, \
run_time, est_run_time, est_sim_time) \
shared(i, cwd, recMode, curRun, curGen, curState, genTmp, clk_freq, t1)
for (i = z; i < num_runs; i++)
{
// randomly initialize content of tissue population
initialize_tissue_pop_s2(&(t.tgen[0]), &s);
initialize_tissue_pop_s2(&(t.tgen[1]), &s);
max_fit_gen = 0;
max_fitness = 0.0;
flagb = 0;
if ((i == z) && (x == 1))
{
w = y;
}
else
{
w = 0;
}
rn = 200;
j1 = 0;
s.run_num = i;
s.maxfitness = 0.0;
//Start run timer
#ifdef OS_WINDOWS
if (display_time) read_clock(&t2);
#endif
#ifdef OS_UNIX
if (display_time) read_clock(CLOCK_REALTIME, &t2);
#endif
#if defined(_OPENMP)
printf("\n ======================================= \n");
printf(" OpenMP Status Message \n");
printf("\n --------------------------------------- \n");
printf("| RUN %d : \n", i);
printf("| New Thread Process (Thread %d) \n", omp_get_thread_num());
printf("| Available Threads: %d of %d \n", omp_get_num_threads(), omp_get_max_threads());
printf(" ======================================= \n\n");
#endif
for (j = w; j < num_gens; j++)
{
// Flips on lightboard data collection. See board7.h.
if (enable_collection == 1) {
if ((i >= run_collect) && (j >= gen_collect)) { collect_data = 1; }
}
sx = readcurrent(recMode);
// Pseudo loop code. Uses bit flipping to cycle through boards.
j2 = ~(j1)& 1;
if (display_dbg) printf("start evaluation...\n");
// evaluate tissue
// Most of the problems in the code happen here.
evaluatepopulation_tissueb(&(t.tgen[j1]), &ra, &bn, &s, j, i);
if (display_dbg) printf("\n");
// display fitness stats to screen
printmaxfitness(&(t.tgen[j1]), i, j, j1, &cwd);
if (display_dbg) printf("start tournament...\n");
// Perform tournament selection and have children ready for evaluation
// Rarely have to touch. Figure out best parents. Crossover operator.
// Create a subgroup. Randomly pick individuals from the population.
// Pick fittest individuals out of the random group.
// 2 parents and 2 children. Children replace parents.
tournamentsel_tissueb(&(t.tgen[j1]), &(t.tgen[j2]), &s);
printf("Tournament selection complete.\n");
// keep track of best fitness during run
if (t.tgen[j1].fit_max > max_fitness)
{
max_fitness = t.tgen[j1].fit_max;
max_fit_gen = j;
}
if ((t.tgen[j1].fit_max > 99.0) && (flagb == 0))
{
flagb = 1;
run_data.fit90[i] = t.tgen[j1].fit_max;
run_data.gen90[i] = j;
}
sa[0] = 0;
strcat(sa, curRun);
sprintf(ss, "%d", i);
strcat(sa, ss);
strcat(sa, ".txt");
printf("Write fitness epc...\n");
// write fitness stats to file
writefitnessepc(sa, &(t), j1, j);
printf("Write fitness complete.\n");
// trunk for saving population to disk
if (sx != 0)
{
sa[0] = 0;
strcat(sa, genTmp);
sprintf(ss, "%d", 1);
strcat(sa, ss);
strcat(sa, ".txt");
if (display_dbg) printf("Saving Current Run\n");
}
//update current generation to file
writecurrent(curGen, j + 1);
if (display_time && j > 0 && (j % 10 == 0 || j % (num_gens - 1) == 0))
{
#ifdef OS_WINDOWS
read_clock(&t3);
sim_time = (t3.QuadPart - t1.QuadPart) / clk_freq.QuadPart;
run_time = (t3.QuadPart - t2.QuadPart) / clk_freq.QuadPart;
#endif
#ifdef OS_UNIX
read_clock(CLOCK_REALTIME, &t3);
sim_time = (double)(t3.tv_sec - t1.tv_sec);
run_time = (double)(t3.tv_sec - t2.tv_sec);
#endif
avg_gen_time = run_time / (j + 1);
est_run_time = avg_gen_time * (num_gens - j);
avg_run_time = est_run_time + run_time;
est_sim_time = (est_run_time * (num_runs - i)) / (i + 1);
printf("\n============= Timing Data =============\n");
printf("Time in Simulation: %.2fs\n", sim_time);
printf("Time in Run: %.2fs\n", run_time);
printf("Est. Time to Complete Run: %.2fs\n", est_run_time);
printf("Est. Time to Complete Simulation: %.2fs\n\n", est_sim_time);
printf("Average Time Per Generation: %.2fs/gen\n", avg_gen_time);
printf("Average Time Per Run: %.2fs/run\n", avg_run_time);
printf("=======================================\n\n");
if (j % (num_gens - 1) == 0) {
}
}
//Display Position Board
//displayboardl(&bn.board[0]);
j1 = j2;
}
}
}
Structures
typedef struct boardcollectionn
{
boardtable board[boardnumb];
} boardcollection;
boardcollection bn;
typedef struct tissue_gent
{
tissue_population tgen[2];
} tissue_genx;
typedef struct sim_paramt //struct for storing simulation parameters
{
int penalty;
int addnum[cell_numz];
int x[9];
int y[9];
uint8_t addvec[9];
uint8_t parenta[50];
uint8_t parentb[50];
int errorstatus;
int ones[outputnum][5000];
int zeros[outputnum][5000];
int probcount;
int num;
int numb;
int numc;
int numd;
int nume;
int numf;
int bestnum;
int count;
int col_flag;
int behaviour[outputnum];
int memm[4];
int sel;
int seldecnum;
int seldec[200];
int selx[200];
int sely[200];
int selz[200];
int countb;
float maxfitness;
float oldmaxfitness;
int run_num;
int collision;
} sim_param;
tissue_genx t;
sim_param s;

The code is too big for a proper testing and the use of global variables really doesn't help to figure out the data dependencies. However I can just make a few remarks:
i is declared shared whereas it is the index of the parallelised loop. This is wrong! If there is a variable that you really want to be private in a omp for loop, it is the loop index. I didn't find anything clear about that in the OpenMP standard for C and C++, whereas for Fortran, the loop index (and the ones of all enclosed loops) is implicitly privatised. Nonetheless, the Intel compiler gives an error while attempting to explicitly declare shared such an index:
sharedi.cc(11): warning #2555: static control variable for parallel loop
for ( i=0; i<10; i++ ) {
^
sharedi.cc(10): error: index variable "i" of for statement following an OpenMP for pragma must be private
#pragma omp parallel for shared(i) schedule(static)
^
compilation aborted for sharedi.cc (code 2)
by the mean-time, gcc version 5.1.0 doesn't emit any warning or error for the same code, and acts as if the variable had been declared private... I tend to find Intel's compiler's behaviour more reasonable, but I'm not 100% sure which one is correct. What I know however is that declaring i shared is definitely a very very bad idea (and even a bug AFAIC). So I feel like this is a grey area where your compiler may or may not do a sensible job, which could all by itself explain most of your problems.
You seem to output your data into files, which names might conflict across threads. Be careful with that as you might end-up with a big mess...
Your printing is very likely to be all messed-up. I don't know what importance you put in that, but that won't be pretty the way it is written for now.
In summary, your code is just to tangled for me to get a clear view on what's happening. Try to address at least the two first points I mentioned, it might be sufficient for getting it to "work". However, I couldn't encourage you enough to clean the code up and to get rid of your global variables. Likewise, try to only declare your variables as late in the sources as possible, since this reduces the need of declaring them private for OpenMP, and it improves greatly readability.
Good luck with your debugging.

Related

Using openmp to distribute matrix multiplication work across multiple GPUs via openacc using C

I am trying to distribute the work of multiplying two NxN matrices across 3 nVidia GPUs using 3 OpenMP threads. (The matrix values will get large hence the long long data type.) However I am having trouble placing the #pragma acc parallel loop in the correct place. I have used some examples in the nVidia PDFs shared but to no luck. I know that the inner most loop cannot be parallelized. But I would like each of the three threads to own a GPU and do a portion of the work. Note that input and output matrices are defined as global variables as I kept running out of stack memory.
I have tried the code below, but I get compilation errors all pointing to line 75 which is the #pragma acc parallel loop line
[test#server ~]pgcc -acc -mp -ta=tesla:cc60 -Minfo=all -o testGPU matrixMultiplyopenmp.c
PGC-S-0035-Syntax error: Recovery attempted by replacing keyword for by keyword barrier (matrixMultiplyopenmp.c: 75)
PGC-S-0035-Syntax error: Recovery attempted by replacing acc by keyword enum (matrixMultiplyopenmp.c: 76)
PGC-S-0036-Syntax error: Recovery attempted by inserting ';' before keyword for (matrixMultiplyopenmp.c: 77)
PGC/x86-64 Linux 18.10-1: compilation completed with severe errors
Function is:
void multiplyMatrix(long long int matrixA[SIZE][SIZE], long long int matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE])
{
// Get Nvidia device type
acc_init(acc_device_nvidia);
// Get Number of GPUs in system
int num_gpus = acc_get_num_devices(acc_device_nvidia);
//Set the number of OpenMP thread to the number of GPUs
#pragma omp parallel num_threads(num_gpus)
{
//Get thread openMP number and set the GPU device to that number
int threadNum = omp_get_thread_num();
acc_set_device_num(threadNum, acc_device_nvidia);
int row;
int col;
int key;
#pragma omp for
#pragma acc parallel loop
for (row = 0; row < SIZE; row++)
for (col = 0; col < SIZE; col++)
for (key = 0; key < SIZE; key++)
matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
}
}
As fisehara points out, you can't have both an OpenMP "for" loop combined with an OpenACC parallel loop on the same for loop. Instead, you need to manually decompose the work across the OpenMP threads. Example below.
Is there a reason why you want to use multiple GPUs here? Most likely the matrix multiply will fit on to a single GPU so there's no need for the extra overhead of introducing host-side parallelization.
Also, I generally recommend using MPI+OpenACC for multi-gpu programming. Domain decomposition is naturally part of MPI but not inherent in OpenMP. Also, MPI gives you a one-to-one relationship between the host process and accelerator, allows for scaling beyond a single node, and you can take advantage of CUDA Aware MPI for direct GPU to GPU data transfers. For more info, do a web search for "MPI OpenACC" and you'll find several tutorials. Class #2 at https://developer.nvidia.com/openacc-advanced-course is a good resource.
% cat test.c
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#ifdef _OPENACC
#include <openacc.h>
#endif
#define SIZE 130
void multiplyMatrix(long long int matrixA[SIZE][SIZE], long long int matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE])
{
#ifdef _OPENACC
// Get Nvidia device type
acc_init(acc_device_nvidia);
// Get Number of GPUs in system
int num_gpus = acc_get_num_devices(acc_device_nvidia);
#else
int num_gpus = omp_get_max_threads();
#endif
if (SIZE<num_gpus) {
num_gpus=SIZE;
}
printf("Num Threads: %d\n",num_gpus);
//Set the number of OpenMP thread to the number of GPUs
#pragma omp parallel num_threads(num_gpus)
{
//Get thread openMP number and set the GPU device to that number
int threadNum = omp_get_thread_num();
#ifdef _OPENACC
acc_set_device_num(threadNum, acc_device_nvidia);
printf("THID %d using GPU: %d\n",threadNum,threadNum);
#endif
int row;
int col;
int key;
int start, end;
int block_size;
block_size = SIZE/num_gpus;
start = threadNum*block_size;
end = start+block_size;
if (threadNum==(num_gpus-1)) {
// add the residual to the last thread
end = SIZE;
}
printf("THID: %d, Start: %d End: %d\n",threadNum,start,end-1);
#pragma acc parallel loop \
copy(matrixProduct[start:end-start][:SIZE]), \
copyin(matrixA[start:end-start][:SIZE],matrixB[:SIZE][:SIZE])
for (row = start; row < end; row++) {
#pragma acc loop vector
for (col = 0; col < SIZE; col++) {
for (key = 0; key < SIZE; key++) {
matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
}}}
}
}
int main() {
long long int matrixA[SIZE][SIZE];
long long int matrixB[SIZE][SIZE];
long long int matrixProduct[SIZE][SIZE];
int i,j;
for(i=0;i<SIZE;++i) {
for(j=0;j<SIZE;++j) {
matrixA[i][j] = (i*SIZE)+j;
matrixB[i][j] = (j*SIZE)+i;
matrixProduct[i][j]=0;
}
}
multiplyMatrix(matrixA,matrixB,matrixProduct);
printf("Result:\n");
for(i=0;i<SIZE;++i) {
printf("%d: %ld %ld\n",i,matrixProduct[i][0],matrixProduct[i][SIZE-1]);
}
}
% pgcc test.c -mp -ta=tesla -Minfo=accel,mp
multiplyMatrix:
28, Parallel region activated
49, Generating copyin(matrixB[:130][:])
Generating copy(matrixProduct[start:end-start][:131])
Generating copyin(matrixA[start:end-start][:131])
Generating Tesla code
52, #pragma acc loop gang /* blockIdx.x */
54, #pragma acc loop vector(128) /* threadIdx.x */
55, #pragma acc loop seq
54, Loop is parallelizable
55, Complex loop carried dependence of matrixA->,matrixProduct->,matrixB-> prevents parallelization
Loop carried dependence of matrixProduct-> prevents parallelization
Loop carried backward dependence of matrixProduct-> prevents vectorization
59, Parallel region terminated
% a.out
Num Threads: 4
THID 0 using GPU: 0
THID: 0, Start: 0 End: 31
THID 1 using GPU: 1
THID: 1, Start: 32 End: 63
THID 3 using GPU: 3
THID: 3, Start: 96 End: 129
THID 2 using GPU: 2
THID: 2, Start: 64 End: 95
Result:
0: 723905 141340355
1: 1813955 425843405
2: 2904005 710346455
3: 3994055 994849505
...
126: 138070205 35988724655
127: 139160255 36273227705
128: 140250305 36557730755
129: 141340355 36842233805
I ran into an issue with MPI+OpenACC compilation on the shared system I was restricted to and could not upgrade the compiler. The solution I ended up using, was breaking the work down with OMP first then calling an OpenACC function as follows:
//Main code
pragma omp parallel num_threads(num_gpus)
{
#pragma omp for private(tid)
for (tid = 0; tid < num_gpus; tid++)
{
//Get thread openMP number and set the GPU device to that number
int threadNum = omp_get_thread_num();
acc_set_device_num(threadNum, acc_device_nvidia);
// check with thread is using which GPU
int gpu_num = acc_get_device_num(acc_device_nvidia);
printf("Thread # %d is going to use GPU # %d \n", threadNum, gpu_num);
//distribute the uneven rows
if (threadNum < extraRows)
{
startRow = threadNum * (rowsPerThread + 1);
stopRow = startRow + rowsPerThread;
}
else
{
startRow = threadNum * rowsPerThread + extraRows;
stopRow = startRow + (rowsPerThread - 1);
}
// Debug to check allocation of data to threads
//printf("Start row is %d, and Stop rows is %d \n", startRow, stopRow);
GPUmultiplyMatrix(matrixA, matrixB, matrixProduct, startRow, stopRow);
}
}
void GPUmultiplyMatrix(long long int matrixA[SIZE][SIZE], long long int
matrixB[SIZE][SIZE], long long int matrixProduct[SIZE][SIZE], int
startRow, int stopRow)
{
int row;
int col;
int key;
#pragma acc parallel loop collapse (2)
for (row = startRow; row <= stopRow; row++)
for (col = 0; col < SIZE; col++)
for (key = 0; key < SIZE; key++)
matrixProduct[row][col] = matrixProduct[row][col] + (matrixA[row][key] * matrixB[key][col]);
}

How to return each thread's output into an array using OpenMP?

I would like to proceed a multi-thread program where each thread outputs an array of unknown number of elements.
For example, select all numbers that < 10 from an int array and put them into a new array.
Pseudo code (8 threads):
int *hugeList = malloc(10000000);
for (long i = 0; i < 1000000; ++i)
{
hugeList[i] = (rand() % 100);//random integers from 0 to 99
}
long *subList[8];//to fill each thread's result
#pragma omp parallel
for (long i = 0; i < 1000000; ++i)
{
long n = 0;
if(hugeList[i] < 10)
{
//do something to fill "subList" properly
subList[threadNo][n] = hugeList[i];
n++;
}
}
Array "subList" should collect the elements in "hugeList" which satisfies condition (<10) ,sequentially and in terms of thread number.
How should I write the code? It is OK if there is a better way using OpenMP.
There are several problems in your code.
1/ omp pragma should be parallel for, if you want the for loop to be parallelized. Otherwise, code will be duplicated in everay thread.
2/ code is incoherent with comment
//do something to fill "subList" properly
hugeList[i] = subList[threadNo][n];
3/ How do you know the number of element in your sublists? It must be returned to main thread. You could use an array, but beware of false sharing. Better use a local var and write it at the end the parallel section.
4/ sublist is not allocated. The difficulty is that you do not know the number of threads. You can ask omp the max number of thread (get_omp_max_thread), and do dynamic allocation. If you want some static allocation, maybe the best is to allocate a large table and to compute the actual address in every thread.
5/ omp code must also work without an openmp compiler. Use #ifdef _OPENMP for that.
Here is an (untested) way your code can be written
#define HUGE 10000000
int *hugeList = (int *) malloc(HUGE);
#ifdef _OPENMP
int thread_nbr=omp_get_max_threads();
#else
int thread_nbr=1; // to ensure proper behavior in a sequential context
#endif
struct thread_results { // to hold per thread results
int nbr; // nbr of generated results
int *results; // actual filtered numbers. Will write in subList table
};
// could be parallelized, but rand is not thread safe. drand48 should be
for (long i = 0; i < 1000000; ++i)
{
hugeList[i] = (rand() % 100);//random integers from 0 to 99
}
int *subList=(int *)malloc(HUGE*sizeof(int)); // table to hold thread results
// this is more complex to have a 2D array here as max_thread and actual number of thread
// are not known at compile time. VLA cannot be used (and array dim can be very large).
// Concerning its size, it is possible to have ALL elements in hugeList selected and the array must be
// dimensionned accordingly to avoid bugs.
struct thread_results* threadres=(struct thread_results *)malloc(thread_nbr*sizeof(struct thread_results));
#pragma omp parallel
{
// first declare and initialize thread vars
#ifdef _OPENMP
int thread_id = omp_get_thread_num() ; // hold thread id
int thread_nbr = omp_get_num_threads() ; // hold actual nbr of threads
#else
// to ensure proper serial behavior
int thread_id = 0;
int thread_nbr = 1;
#endif
struct thread_results *res=threadres+thread_id;
res->nbr=0;
// compute address in subList table
res->results=subList+(HUGE/thread_nbr)*thread_id;
int * res_ptr=res->results; // local pointer. Each thread points to independent part of subList table
int n=0; // number of results. We want one per thread to only have local updates.
#pragma omp for
for (long i = 0; i < 1000000; ++i)
{
if(hugeList[i] < 10)
{
//do something to fill "subList" properly
res_ptr[n]=hugeList[i];
n++;
}
}
res->nbr=n;
}
Updated complete codes based on #Alain Merigot 's answer
I tested the following code; It is reproducible (including presence & absence of #pragma arguments).
However, only the front elements of subList are correct, while the rest are empty.
(filename.c)
#include <stdio.h>
#include <time.h>
#include <omp.h>
#include <stdlib.h>
#include <math.h>
#define HUGE 10000000
#define DELAY 1000 //depends on your CPU power
//use global variables to store desired results, otherwise can't be obtain outside "pragma"
int n = 0;// number of results. We want one per thread to only have local updates.
double *subList;// table to hold thread results
int main()
{
double *hugeList = (double *)malloc(HUGE);
#ifdef _OPENMP
int thread_nbr = omp_get_max_threads();
#else
int thread_nbr = 1; // to ensure proper behavior in a sequential context
#endif
struct thread_results
{ // to hold per thread results
int nbr; // nbr of generated results
double *results; // actual filtered numbers. Will write in subList table
};
// could be parallelized, but rand is not thread safe. drand48 should be
for (long i = 0; i < 1000000; ++i)
{
hugeList[i] = sin(i); //fixed array content to test reproducibility
}
subList = (double *)malloc(HUGE * sizeof(double)); // table to hold thread results
// this is more complex to have a 2D array here as max_thread and actual number of thread
// are not known at compile time. VLA cannot be used (and array dim can be very large).
// Concerning its size, it is possible to have ALL elements in hugeList selected and the array must be
// dimensionned accordingly to avoid bugs.
struct thread_results *threadres = (struct thread_results *)malloc(thread_nbr * sizeof(struct thread_results));
#pragma omp parallel
{
// first declare and initialize thread vars
#ifdef _OPENMP
int thread_id = omp_get_thread_num(); // hold thread id
int thread_nbr = omp_get_num_threads(); // hold actual nbr of threads
#else
// to ensure proper serial behavior
int thread_id = 0;
int thread_nbr = 1;
#endif
struct thread_results *res = threadres + thread_id;
res->nbr = 0;
// compute address in subList table
res->results = subList + (HUGE / thread_nbr) * thread_id;
double *res_ptr = res->results; // local pointer. Each thread points to independent part of subList table
#pragma omp for reduction(+ \
: n)
for (long i = 0; i < 1000000; ++i)
{
for (int i = 0; i < DELAY; ++i){}//do nothing, just waste time
if (hugeList[i] < 0)
{
//do something to fill "subList" properly
res_ptr[n] = hugeList[i];
n++;
}
}
res->nbr = n;
}
for (int i = 0; i < 10; ++i)
{
printf("sublist %d: %lf\n", i, subList[i]);//show some elements of subList to check reproducibility
}
printf("n = %d\n", n);
}
Linux compile: gcc -o filename filename.c -fopenmp -lm
I hope there can be more discussion of the mechanism of this code.

What to heed, when reading an array from multiple threads?

I'd like get to know OpenMP a bit, cause I'd like to have a huge loop parallelized. After some reading (SO, Common OMP mistakes, tutorial, etc), I've taken as a first step the basically working c/mex code given below (which yields different results for the first test case).
The first test does sum up result values - functions serial, parallel -,
the second takes values from an input array and writes the processed values to an output array - functions serial_a, parallel_a.
My questions are:
Why differ the results of the first test, i. e. the results of the serial and parallel
Suprisingly the second test succeeds. My concern is about, how to handle memory (array locations) which possibly are read by multiple threads? In the example this should be emulated by a[i])/cos(a[n-i].
Are there some easy rules how to determine which variables to declare as private, shared and reduction?
In both cases int i is outside the pragma, however the second test appears to yield correct results. So is that okay or has i to be moved into the pragma omp parallel region, as being said here?
Any other hints on spoted mistakes?
Code
#include "mex.h"
#include <math.h>
#include <omp.h>
#include <time.h>
double serial(int x)
{
double sum=0;
int i;
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
return sum;
}
double parallel(int x)
{
double sum=0;
int i;
#pragma omp parallel num_threads(6) shared(sum) //default(none)
{
//printf(" I'm thread no. %d\n", omp_get_thread_num());
#pragma omp for private(i, x) reduction(+: sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
return sum;
}
void serial_a(double* a, int n, double* y2)
{
int i;
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
void parallel_a(double* a, int n, double* y2)
{
int i;
#pragma omp parallel num_threads(6)
{
#pragma omp for private(i)
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
}
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
double sum, *y1, *y2, *a, s, p;
int x, n, *d;
/* Check for proper number of arguments. */
if(nrhs!=2) {
mexErrMsgTxt("Two inputs required.");
} else if(nlhs>2) {
mexErrMsgTxt("Too many output arguments.");
}
/* Get pointer to first input */
x = (int)mxGetScalar(prhs[0]);
/* Get pointer to second input */
a = mxGetPr(prhs[1]);
d = (int*)mxGetDimensions(prhs[1]);
n = (int)d[1]; // row vector
/* Create space for output */
plhs[0] = mxCreateDoubleMatrix(2,1, mxREAL);
plhs[1] = mxCreateDoubleMatrix(n,2, mxREAL);
/* Get pointer to output array */
y1 = mxGetPr(plhs[0]);
y2 = mxGetPr(plhs[1]);
{ /* Do the calculation */
clock_t tic = clock();
y1[0] = serial(x);
s = (double) clock()-tic;
printf("serial....: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
y1[1] = parallel(x);
p = (double) clock()-tic;
printf("parallel..: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
mexEvalString("drawnow");
tic = clock();
serial_a(a, n, y2);
s = (double) clock()-tic;
printf("serial_a..: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
parallel_a(a, n, &y2[n]);
p = (double) clock()-tic;
printf("parallel_a: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
}
}
Output
>> mex omp1.c
>> [a, b] = omp1(1e8, 1:1e8);
serial....: 13399 ms
parallel..: 2810 ms
ratio.....: 0.21
serial_a..: 12840 ms
parallel_a: 2740 ms
ratio.....: 0.21
>> a(1) == a(2)
ans =
0
>> all(b(:,1) == b(:,2))
ans =
1
System
MATLAB Version: 8.0.0.783 (R2012b)
Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1)
Microsoft Visual Studio 2005 Version 8.0.50727.867
In your function parallel you have a few mistakes. The reduction should be declared when you use parallel. Private and share variables should also be declared when you use parallel. But when you do a reduction you should not declare the variable that is being reduced as shared. The reduction will take care of this.
To know what to declare private or shared you have to ask yourself which variables are being written to. If a variable is not being written to then normally you want it to be shared. In your case the variable x does not change so you should declare it shared. The variable i, however, does change so normally you should declare it private so to fix your function you could do
#pragma omp parallel reduction(+:sum) private(i) shared(x)
{
#pragma omp for
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
However, OpenMP automatically makes the iterator of a parallel for region private and variables declared outside of parallel regions are shared by default so for your parallel function you can simply do
#pragma omp parallel for reduction(+:sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
Notice that the only difference between this and your serial code is the pragma statment. OpenMP is designed so that you don't have to change your code except for pragma statments.
When it comes to arrays as long as each iteration of a parallel for loop acts on a different array element then you don't have to worry about shared and private. So you can write your private_a function simply as
#pragma omp parallel for
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
and once again it is the same as your serial_a function except for the pragma statement.
But be careful with assuming iterators are private. Consider the following double loop
for(i=0; i<n; i++) {
for(j=0; j<m; j++) {
//
}
}
If you use #pragma parallel for with that the i iterator will be made private but the j iterator will be shared. This is because the parallel for only applies to the outer loop over i and since j is shared by default it is not made private. In this case you would need to explicitly declare j private like this #pragma parallel for private(j).

Differences in OpenMP performance with different versions of OS

I have a piece of code that i wrote a time ago. The only purpose of it was an experiment with openMP. But i recently switched form a MacBook Pro Lion (early 2011) to a MacBook Pro Mountain Lion (early 2013). If it would help to get more hardware of other info, I would be happy to give them.
The code worked fine on the old one, meaning 8 threads got a 100% (98% min) load on my processor. And now the identical code, recompiled on my new machine gets only a 62% max processor load. Even if I raise the threads. The processor loads are both measured with "istat pro".
My question is what can cause this to happen?
EDIT: The problem seems to be solved if I delete the for in #pragma omp parallel for shared(largest_factor, largest). So I get #pragma omp parallel shared(largest_factor, largest)
But I still don't understand why it works.
The code in question:
#include <stdio.h>
#include <omp.h>
double fib(double n);
int main()
{
int data[] = {124847,194747,194747,194747,194747,
194747,194747,194747,194747,194747,194747};
int largest, largest_factor = 0;
omp_set_num_threads(8);
/* "omp parallel for" turns the for loop multithreaded by making each thread
* iterating only a part of the loop variable, in this case i; variables declared
* as "shared" will be implicitly locked on access
*/
#pragma omp parallel for shared(largest_factor, largest)
for (int i = 0; i < 10; i++) {
int p, n = data[i];
for (p = 3; p * p <= n && n % p; p += 2);
printf("\n%f\n\n",fib(i+40));
if (p * p > n) p = n;
if (p > largest_factor) {
largest_factor = p;
largest = n;
printf("thread %d: found larger: %d of %d\n",
omp_get_thread_num(), p, n);
}
else
{
printf("thread %d: not larger: %d of %d\n",
omp_get_thread_num(), p, n);
}
}
printf("Largest factor: %d of %d\n", largest_factor, largest);
return 0;
}
double fib(double n)
{
if (n<=1)
{
return 1;
}
else
{
return fib(n-1)+fib(n-2);
}
}
The main reason you don't see all threads being used is that each thread takes different time (due to the recursive function or the inner loop) and you only have 10 iterations. The fast threads finish fast and then there are only a few threads left to run. When you first run your code it starts off 100% and falls off as the fast threads finish and the few last slow threads are still running. If you change your iterations to 100 (and increase the data array) you will see the CPU usage at 100% for much longer. I added some timing printouts to your code.
Also I think you have a race condition with your shared variables so I put in a critical section.
To answer your question about the code without the "for" statement what that's doing is running the same code on eight different threads! Instead of threads running a particular iteration they each run all 10 iterations. That's going to be no faster than running a single thread and perhaps even slower.
Lastly since each iteration takes different time in general you should use "schedual(dynamic)" like this
#pragma omp parallel for shared(largest_factor, largest) schedule(dynamic)
However, since you only have 10 iterations I don't think it will make much difference in this case. Here is what I did to your code to understand what is going on:
#include <stdio.h>
#include <omp.h>
double fib(double n);
int main()
{
int data[] = {124847,194747,194747,194747,194747,
194747,194747,194747,194747,194747,194747};
int largest, largest_factor = 0;
omp_set_num_threads(8);
/* "omp parallel for" turns the for loop multithreaded by making each thread
* iterating only a part of the loop variable, in this case i; variables declared
* as "shared" will be implicitly locked on access
*/
#pragma omp parallel for shared(largest_factor, largest)
for (int i = 0; i < 10; i++) {
int p, n = data[i];
double time = omp_get_wtime();
for (p = 3; p * p <= n && n % p; p += 2);
printf("\n iteratnion %d, fib %f\n\n",i, fib(i+40));
time = omp_get_wtime() - time;
printf("time %f\n", time);
if (p * p > n) p = n;
#pragma omp critical
{
if (p > largest_factor) {
largest_factor = p;
largest = n;
printf("thread %d: found larger: %d of %d\n",
omp_get_thread_num(), p, n);
}
else {
printf("thread %d: not larger: %d of %d\n",
omp_get_thread_num(), p, n);
}
}
}
printf("Largest factor: %d of %d\n", largest_factor, largest);
return 0;
}
double fib(double n) {
if (n<=1) {
return 1;
}
else {
return fib(n-1)+fib(n-2);
}
}

Why is my parallel code slower than sequential code?

I have implemented a parallel code in C for merge sort using OPENMP. I get speed up of 3.9 seconds which is quite slower that the sequential version of the same code(for which i get 3.6). I am trying to optimise the code to the best possible state but cant increase the speedup. Can you please help out with this? Thanks.
void partition(int arr[],int arr1[],int low,int high,int thread_count)
{
int tid,mid;
#pragma omp if
if(low<high)
{
if(thread_count==1)
{
mid=(low+high)/2;
partition(arr,arr1,low,mid,thread_count);
partition(arr,arr1,mid+1,high,thread_count);
sort(arr,arr1,low,mid,high);
}
else
{
#pragma omp parallel num_threads(thread_count)
{
mid=(low+high)/2;
#pragma omp parallel sections
{
#pragma omp section
{
partition(arr,arr1,low,mid,thread_count/2);
}
#pragma omp section
{
partition(arr,arr1,mid+1,high,thread_count/2);
}
}
}
sort(arr,arr1,low,mid,high);
}
}
}
As was correctly noted, there are several mistakes in your code that prevent its correct execution, so I would first suggest to review these errors.
Anyhow, taking into account only how OpenMP performance scales with thread, maybe an implementation based on task directives would fit better as it overcomes the limits already pointed by a previous answer:
Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause
You can find a trace of such an implementation below:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/time.h>
void getTime(double *t) {
struct timeval tv;
gettimeofday(&tv, 0);
*t = tv.tv_sec + (tv.tv_usec * 1e-6);
}
int compare( const void * pa, const void * pb ) {
const int a = *((const int*) pa);
const int b = *((const int*) pb);
return (a-b);
}
void merge(int * array, int * workspace, int low, int mid, int high) {
int i = low;
int j = mid + 1;
int l = low;
while( (l <= mid) && (j <= high) ) {
if( array[l] <= array[j] ) {
workspace[i] = array[l];
l++;
} else {
workspace[i] = array[j];
j++;
}
i++;
}
if (l > mid) {
for(int k=j; k <= high; k++) {
workspace[i]=array[k];
i++;
}
} else {
for(int k=l; k <= mid; k++) {
workspace[i]=array[k];
i++;
}
}
for(int k=low; k <= high; k++) {
array[k] = workspace[k];
}
}
void mergesort_impl(int array[],int workspace[],int low,int high) {
const int threshold = 1000000;
if( high - low > threshold ) {
int mid = (low+high)/2;
/* Recursively sort on halves */
#ifdef _OPENMP
#pragma omp task
#endif
mergesort_impl(array,workspace,low,mid);
#ifdef _OPENMP
#pragma omp task
#endif
mergesort_impl(array,workspace,mid+1,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
/* Merge the two sorted halves */
#ifdef _OPENMP
#pragma omp task
#endif
merge(array,workspace,low,mid,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
} else if (high - low > 0) {
/* Coarsen the base case */
qsort(&array[low],high-low+1,sizeof(int),compare);
}
}
void mergesort(int array[],int workspace[],int low,int high) {
#ifdef _OPENMP
#pragma omp parallel
#endif
{
#ifdef _OPENMP
#pragma omp single nowait
#endif
mergesort_impl(array,workspace,low,high);
}
}
const size_t largest = 100000000;
const size_t length = 10000000;
int main(int argc, char *argv[]) {
int * array = NULL;
int * workspace = NULL;
double start,end;
printf("Largest random number generated: %d \n",RAND_MAX);
printf("Largest random number after truncation: %d \n",largest);
printf("Array size: %d \n",length);
/* Allocate and initialize random vector */
array = (int*) malloc(length*sizeof(int));
workspace = (int*) malloc(length*sizeof(int));
for( int ii = 0; ii < length; ii++)
array[ii] = rand()%largest;
/* Sort */
getTime(&start);
mergesort(array,workspace,0,length-1);
getTime(&end);
printf("Elapsed time sorting: %g sec.\n", end-start);
/* Check result */
for( int ii = 1; ii < length; ii++) {
if( array[ii] < array[ii-1] ) printf("Error:\n%d %d\n%d %d\n",ii-1,array[ii-1],ii,array[ii]);
}
free(array);
free(workspace);
return 0;
}
Notice that if you seek performances you also have to guarantee that the base case of your recursion is coarse enough to avoid substantial overhead due to recursive function calls. Other than that, I would suggest to profile your code so you can have a good hint on which parts are really worth optimizing.
It took some figuring out, which is a bit embarassing, since when you see it, the answer is so simple.
As it stands in the question, the program doesn't work correctly, instead it randomly on some runs duplicates some numbers and loses others. This appears to be a totally parallel error, that doesn't arise when running the program with the variable thread_count == 1.
The pragma "parallel sections", is a combined parallel and sections directive, which in this case means, that it starts a second parallel region inside the previous one. Parallel regions inside other parallel regions are fine, but I think most implementation don't give you extra threads when they encounter a nested parallel region.
The fix is to replace
#pragma omp parallel sections
with
#pragma omp sections
After this fix, the program starts to give correct answers, and with a two core system and for a million numbers I get for timing the following results.
One thread:
time taken: 0.378794
Two threads:
time taken: 0.203178
Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause, so change num_threads(thread_count) -> num_threads(2)
But because of the fact that at least the two implementations I tried are not able to spawn new threads for nested parallel regions, the program as it stands doesn't scale to more than two threads.

Resources