I am learning OpenMP and found the following code to solve N-Queens problem. I want to make faster solution with parallel code, but I don't know how to do it. Everytime I tried something, it is only go to worse and slower solution.
This is code:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define BOARD_SIZE 15
int board[20],solutions;
int main() {
double start_time, end_time, result_time;
void queen(int row,int n);
start_time = omp_get_wtime();
queen(1, BOARD_SIZE);
end_time = omp_get_wtime();
result_time = end_time - start_time;
printf("Time = %.3f seconds\n", result_time);
printf("Number of solutions: %d\n", solutions);
return 0;
}
/*funtion to check conflicts
If no conflict for desired postion returns 1 otherwise returns 0*/
int place(int row, int column) {
for(int i=1;i<=row-1;++i) {
//checking column and diagonal conflicts
if(board[i]==column)
return 0;
else
if(abs(board[i]-column)==abs(i-row))
return 0;
}
return 1; //no conflicts
}
//function to check for proper positioning of queen
void queen(int row,int n) {
for(int column=1;column<=n;++column) {
if(place(row,column)) {
board[row]=column; //no conflicts so place queen
if(row==n) //dead end
solutions++;
else //try queen with next position
queen(row+1,n);
}
}
}
What can I do for optimize this code?
To parallelize your code with OpenMP I did the following:
I have added #pragma omp parallel and #pragma omp single before the first call of function queen.
To avoid data race I added #pragma omp atomic before solutions++; (Note that I have also tried using reduction, but in that case tasks have to be synchronized, which makes it a slower option).
To make parallel recursive calls I had to create a (private) copy of the board (and to pass its pointer to function queen). To create tasks I used #pragma omp task if(row<4). Note that if(row<4) is used to limit the number of tasks created.
The measured speedup is about 6x on 8 cores.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define BOARD_SIZE 15
int solutions;
int main() {
double start_time, end_time, result_time;
int board[BOARD_SIZE+1];
void queen(int row,int n, int*);
start_time = omp_get_wtime();
#pragma omp parallel
#pragma omp single
queen(1, BOARD_SIZE, board);
end_time = omp_get_wtime();
result_time = end_time - start_time;
printf("Time = %.3f seconds\n", result_time);
printf("Number of solutions: %d\n", solutions);
return 0;
}
/*funtion to check conflicts
If no conflict for desired postion returns 1 otherwise returns 0*/
int place(int row, int column, int* board) {
for(int i=1;i<=row-1;++i) {
//checking column and diagonal conflicts
if(board[i]==column)
return 0;
else
if(abs(board[i]-column)==abs(i-row))
return 0;
}
return 1; //no conflicts
}
//function to check for proper positioning of queen
void queen(int row,int n, int* board) {
for(int column=1;column<=n;++column) {
if(place(row,column, board)) {
if(row==n) //dead end
{
#pragma omp atomic
solutions++;
}
else
//try queen with next position
{
int local_board[BOARD_SIZE+1];
for(int i=1;i<row;i++) local_board[i]=board[i]; //copy board to local_board
local_board[row]=column; //no conflicts so place queen
#pragma omp task if(row<4)
queen(row+1,n, local_board);
}
}
}
}
Please indicate if you need more explanation.
I see no omp directives in your code.
As Jerome suggested, the way to do this is with tasks.
I coded this a while ago. Download my textbook and see the chapter with OMP examples (or html version. The solution uses taskloop and omp cancel taskloop.
Unfortunately, you get no speedup: the sequential code finds the first solution very fast, and going parallel doesn't make it faster.
Oh, sorry, this uses C++. I'm sure the translation to C is simple enough. But C++ is a better language these days.
Related
I am trying to parallelize with OpenMP a recursive function with this form:
void RecursiveFunction(double *array, int size)
{
if(n > 1)
{
//Serial code
//0 < index < n
RecursiveFunction(array, index);
RecursiveFunction(array, n - index-1);
}
}
int main()
{
//Serial code
RecursiveFunction(array, size);
//Serial code
}
The compiler (gcc) throws an error when more than a certain number of threads are created (between 800 and 900), stating that "libgomp: Thread creation failed: Resource temporarily unavailable". So I am trying to set a fixed number of threads, using omp_set_num_threads(fixedThreadNumber), so that when, while recursing, more than the maximum number of threads are created, the system must wait for a thread to be freed in order to make another recursion.
Apart from setting the max thread number with omp_set_num_threads, I also tried setting omp_set_dynamic to 0, omp_set_nested to 1 and modifying the recursive function to this form:
int main()
{
//Serial code
omp_set_num_threads(numberOfThreads);
omp_set_dynamic(0);
omp_set_nested(1);
RecursiveFunction(array, size);
//Serial code
}
void RecursiveFunction(double *array, int size)
{
if(n > 1)
{
//Serial code
//0 < index < n
#pragma omp parallel sections
{
#pragma omp section
{
RecursiveFunction(array, index);
}
#pragma omp section
{
RecursiveFunction(array, n-index-1);
}
}
}
}
But no matter what I tried, threads will either be created dynamically and will lead to the aforementioned error, or only two threads will be created and every recursion will use these two threads.
I would be grateful if someone could point me in the right direction.
I'm trying to do a parallel for inside a while, somothing like this:
while(!End){
for(...;...;...) // the parallel for
...
// serial code
}
The for loop is the only parallel section of the while loop. If I do this, I have a lot of overhead:
cycles = 0;
while(!End){ // 1k Million iterations aprox
#pragma omp parallel for
for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
if(time[i] == cycles){
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
}
// serial code
++cycles;
}
Each iteration of the for loop are indepent with each other.
There are dependencies between serial code and parallel code.
So normally one doesn't have to worry too much about putting parallel regions into loops, as modern openmp implementations are pretty efficient about using things like thread teams and as long as there's lots of work in the loop you're fine. But here, with an outer loop count of ~1e9 and an inner loop count of ~256 - and very little work being done per iteration - the overhead is likely comparable to or worse than the amount of work being done and performance will suffer.
So there will be a noticeable difference between this:
cycles = 0;
while(!End){ // 1k Million iterations aprox
#pragma omp parallel for
for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
if(time[i] == cycles){
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
}
// serial code
++cycles;
}
and this:
cycles = 0;
#pragma omp parallel
while(!End){ // 1k Million iterations aprox
#pragma omp for
for(i=0;i<N;i++) // the parallel for with 256 iteration aprox
if(time[i] == cycles){
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
}
// serial code
#pragma omp single
{
++cycles;
}
}
But really, that scan across the time array every iteration is unfortunately both (a) slow and (b) not enough work to keep multiple cores busy - it's memory intensive. With more than a couple of threads you will actually have worse performance than serial, even without overheads, just because of memory contention. Admittedly what you have posted here is just an example, not your real code, but why don't you preprocess the time array so you can just check to see when the next task is ready to update:
#include <stdio.h>
#include <stdlib.h>
struct tasktime_t {
long int time;
int task;
};
int stime_compare(const void *a, const void *b) {
return ((struct tasktime_t *)a)->time - ((struct tasktime_t *)b)->time;
}
int main(int argc, char **argv) {
const int n=256;
const long int niters = 100000000l;
long int time[n];
int wbusy[n];
int wfinished[n];
for (int i=0; i<n; i++) {
time[i] = rand() % niters;
wbusy[i] = 1;
wfinished[i] = 0;
}
struct tasktime_t stimes[n];
for (int i=0; i<n; i++) {
stimes[i].time = time[i];
stimes[i].task = i;
}
qsort(stimes, n, sizeof(struct tasktime_t), stime_compare);
long int cycles = 0;
int next = 0;
while(cycles < niters){ // 1k Million iterations aprox
while ( (next < n) && (stimes[next].time == cycles) ) {
int i = stimes[next].task;
if (wbusy[i]){
wbusy[i] = 0;
wfinished[i] = 1;
}
next++;
}
++cycles;
}
return 0;
}
This is ~5 times faster than the serial version of the scanning approach (and much faster than the OpenMP versions). Even if you are constantly updating the time/wbusy/wfinished arrays in the serial code, you can keep track of their completion times using a priority queue with each update taking O(ln(N)) time instead of scanning every iteration taking O(N) time.
I'd like get to know OpenMP a bit, cause I'd like to have a huge loop parallelized. After some reading (SO, Common OMP mistakes, tutorial, etc), I've taken as a first step the basically working c/mex code given below (which yields different results for the first test case).
The first test does sum up result values - functions serial, parallel -,
the second takes values from an input array and writes the processed values to an output array - functions serial_a, parallel_a.
My questions are:
Why differ the results of the first test, i. e. the results of the serial and parallel
Suprisingly the second test succeeds. My concern is about, how to handle memory (array locations) which possibly are read by multiple threads? In the example this should be emulated by a[i])/cos(a[n-i].
Are there some easy rules how to determine which variables to declare as private, shared and reduction?
In both cases int i is outside the pragma, however the second test appears to yield correct results. So is that okay or has i to be moved into the pragma omp parallel region, as being said here?
Any other hints on spoted mistakes?
Code
#include "mex.h"
#include <math.h>
#include <omp.h>
#include <time.h>
double serial(int x)
{
double sum=0;
int i;
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
return sum;
}
double parallel(int x)
{
double sum=0;
int i;
#pragma omp parallel num_threads(6) shared(sum) //default(none)
{
//printf(" I'm thread no. %d\n", omp_get_thread_num());
#pragma omp for private(i, x) reduction(+: sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
return sum;
}
void serial_a(double* a, int n, double* y2)
{
int i;
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
void parallel_a(double* a, int n, double* y2)
{
int i;
#pragma omp parallel num_threads(6)
{
#pragma omp for private(i)
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
}
}
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, const mxArray* prhs[])
{
double sum, *y1, *y2, *a, s, p;
int x, n, *d;
/* Check for proper number of arguments. */
if(nrhs!=2) {
mexErrMsgTxt("Two inputs required.");
} else if(nlhs>2) {
mexErrMsgTxt("Too many output arguments.");
}
/* Get pointer to first input */
x = (int)mxGetScalar(prhs[0]);
/* Get pointer to second input */
a = mxGetPr(prhs[1]);
d = (int*)mxGetDimensions(prhs[1]);
n = (int)d[1]; // row vector
/* Create space for output */
plhs[0] = mxCreateDoubleMatrix(2,1, mxREAL);
plhs[1] = mxCreateDoubleMatrix(n,2, mxREAL);
/* Get pointer to output array */
y1 = mxGetPr(plhs[0]);
y2 = mxGetPr(plhs[1]);
{ /* Do the calculation */
clock_t tic = clock();
y1[0] = serial(x);
s = (double) clock()-tic;
printf("serial....: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
y1[1] = parallel(x);
p = (double) clock()-tic;
printf("parallel..: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
mexEvalString("drawnow");
tic = clock();
serial_a(a, n, y2);
s = (double) clock()-tic;
printf("serial_a..: %.0f ms\n", s);
mexEvalString("drawnow");
tic = clock();
parallel_a(a, n, &y2[n]);
p = (double) clock()-tic;
printf("parallel_a: %.0f ms\n", p);
printf("ratio.....: %.2f \n", p/s);
}
}
Output
>> mex omp1.c
>> [a, b] = omp1(1e8, 1:1e8);
serial....: 13399 ms
parallel..: 2810 ms
ratio.....: 0.21
serial_a..: 12840 ms
parallel_a: 2740 ms
ratio.....: 0.21
>> a(1) == a(2)
ans =
0
>> all(b(:,1) == b(:,2))
ans =
1
System
MATLAB Version: 8.0.0.783 (R2012b)
Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1)
Microsoft Visual Studio 2005 Version 8.0.50727.867
In your function parallel you have a few mistakes. The reduction should be declared when you use parallel. Private and share variables should also be declared when you use parallel. But when you do a reduction you should not declare the variable that is being reduced as shared. The reduction will take care of this.
To know what to declare private or shared you have to ask yourself which variables are being written to. If a variable is not being written to then normally you want it to be shared. In your case the variable x does not change so you should declare it shared. The variable i, however, does change so normally you should declare it private so to fix your function you could do
#pragma omp parallel reduction(+:sum) private(i) shared(x)
{
#pragma omp for
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
}
However, OpenMP automatically makes the iterator of a parallel for region private and variables declared outside of parallel regions are shared by default so for your parallel function you can simply do
#pragma omp parallel for reduction(+:sum)
for(i = 0; i<x; i++){
sum += sin(x*i) / cos(x*i+1.0);
}
Notice that the only difference between this and your serial code is the pragma statment. OpenMP is designed so that you don't have to change your code except for pragma statments.
When it comes to arrays as long as each iteration of a parallel for loop acts on a different array element then you don't have to worry about shared and private. So you can write your private_a function simply as
#pragma omp parallel for
for(i = 0; i<n; i++){
y2[i] = sin(a[i]) / cos(a[n-i]+1.0);
}
and once again it is the same as your serial_a function except for the pragma statement.
But be careful with assuming iterators are private. Consider the following double loop
for(i=0; i<n; i++) {
for(j=0; j<m; j++) {
//
}
}
If you use #pragma parallel for with that the i iterator will be made private but the j iterator will be shared. This is because the parallel for only applies to the outer loop over i and since j is shared by default it is not made private. In this case you would need to explicitly declare j private like this #pragma parallel for private(j).
I've just started looking into OpenMP and have been reading up on tasks. It seems that Sun's example here is actually slower than the sequential version. I believe this has to do with the overhead of task creation and management. Is this correct? If so, is there a way to make the code faster using tasks without changing the algorithm?
int fib(int n)
{
int i, j;
if (n<2)
return n;
else
{
#pragma omp task shared(i) firstprivate(n)
i=fib(n-1);
#pragma omp task shared(j) firstprivate(n)
j=fib(n-2);
#pragma omp taskwait
return i+j;
}
}
The only sensible way is to cut the parallelism at a certain level, below which it makes no sense as the overhead becomes larger than the work being done. The best way to do it is to have two separate fib implementations - a serial one, let's say fib_ser, and a parallel one, fib - and to switch between them under a given threshold.
int fib_ser(int n)
{
if (n < 2)
return n;
else
return fib_ser(n-1) + fib_ser(n-2);
}
int fib(int n)
{
int i, j;
if (n <= 20)
return fib_ser(n);
else
{
#pragma omp task shared(i)
i = fib(n-1);
#pragma omp task shared(j)
j = fib(n-2);
#pragma omp taskwait
return i+j;
}
}
Here the threshold is n == 20. It was chosen arbitrarily and its optimal value would be different on different machines and different OpenMP runtimes.
Another option would be to dynamically control the tasking with the if clause:
int fib(int n)
{
int i, j;
if (n<2)
return n;
else
{
#pragma omp task shared(i) if(n > 20)
i=fib(n-1);
#pragma omp task shared(j) if(n > 20)
j=fib(n-2);
#pragma omp taskwait
return i+j;
}
}
This would switch off the two explicit tasks if n <= 20, and the code would execute in serial, but there would still be some overhead from the OpenMP code transformation, so it would run slower than the previous version with the separate serial implementation.
You are correct, the slow down has to do with the overhead of task creation and management.
Adding the if(n > 20) is a great way to prune the task tree, and even more optimization can be made by not creating the second task. When the new tasks get spawned, the parent thread does nothing. The code can be sped up by allowing the parent task to take care of one of the fib calls instead of creating more tasks:
int fib(int n){
int i, j;
if (n < 2)
return n;
else{
#pragma omp task shared(i) if(n > 20)
i=fib(n-1);
j=fib(n-2);
#pragma omp taskwait
return i+j;
}
}
I have implemented a parallel code in C for merge sort using OPENMP. I get speed up of 3.9 seconds which is quite slower that the sequential version of the same code(for which i get 3.6). I am trying to optimise the code to the best possible state but cant increase the speedup. Can you please help out with this? Thanks.
void partition(int arr[],int arr1[],int low,int high,int thread_count)
{
int tid,mid;
#pragma omp if
if(low<high)
{
if(thread_count==1)
{
mid=(low+high)/2;
partition(arr,arr1,low,mid,thread_count);
partition(arr,arr1,mid+1,high,thread_count);
sort(arr,arr1,low,mid,high);
}
else
{
#pragma omp parallel num_threads(thread_count)
{
mid=(low+high)/2;
#pragma omp parallel sections
{
#pragma omp section
{
partition(arr,arr1,low,mid,thread_count/2);
}
#pragma omp section
{
partition(arr,arr1,mid+1,high,thread_count/2);
}
}
}
sort(arr,arr1,low,mid,high);
}
}
}
As was correctly noted, there are several mistakes in your code that prevent its correct execution, so I would first suggest to review these errors.
Anyhow, taking into account only how OpenMP performance scales with thread, maybe an implementation based on task directives would fit better as it overcomes the limits already pointed by a previous answer:
Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause
You can find a trace of such an implementation below:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/time.h>
void getTime(double *t) {
struct timeval tv;
gettimeofday(&tv, 0);
*t = tv.tv_sec + (tv.tv_usec * 1e-6);
}
int compare( const void * pa, const void * pb ) {
const int a = *((const int*) pa);
const int b = *((const int*) pb);
return (a-b);
}
void merge(int * array, int * workspace, int low, int mid, int high) {
int i = low;
int j = mid + 1;
int l = low;
while( (l <= mid) && (j <= high) ) {
if( array[l] <= array[j] ) {
workspace[i] = array[l];
l++;
} else {
workspace[i] = array[j];
j++;
}
i++;
}
if (l > mid) {
for(int k=j; k <= high; k++) {
workspace[i]=array[k];
i++;
}
} else {
for(int k=l; k <= mid; k++) {
workspace[i]=array[k];
i++;
}
}
for(int k=low; k <= high; k++) {
array[k] = workspace[k];
}
}
void mergesort_impl(int array[],int workspace[],int low,int high) {
const int threshold = 1000000;
if( high - low > threshold ) {
int mid = (low+high)/2;
/* Recursively sort on halves */
#ifdef _OPENMP
#pragma omp task
#endif
mergesort_impl(array,workspace,low,mid);
#ifdef _OPENMP
#pragma omp task
#endif
mergesort_impl(array,workspace,mid+1,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
/* Merge the two sorted halves */
#ifdef _OPENMP
#pragma omp task
#endif
merge(array,workspace,low,mid,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
} else if (high - low > 0) {
/* Coarsen the base case */
qsort(&array[low],high-low+1,sizeof(int),compare);
}
}
void mergesort(int array[],int workspace[],int low,int high) {
#ifdef _OPENMP
#pragma omp parallel
#endif
{
#ifdef _OPENMP
#pragma omp single nowait
#endif
mergesort_impl(array,workspace,low,high);
}
}
const size_t largest = 100000000;
const size_t length = 10000000;
int main(int argc, char *argv[]) {
int * array = NULL;
int * workspace = NULL;
double start,end;
printf("Largest random number generated: %d \n",RAND_MAX);
printf("Largest random number after truncation: %d \n",largest);
printf("Array size: %d \n",length);
/* Allocate and initialize random vector */
array = (int*) malloc(length*sizeof(int));
workspace = (int*) malloc(length*sizeof(int));
for( int ii = 0; ii < length; ii++)
array[ii] = rand()%largest;
/* Sort */
getTime(&start);
mergesort(array,workspace,0,length-1);
getTime(&end);
printf("Elapsed time sorting: %g sec.\n", end-start);
/* Check result */
for( int ii = 1; ii < length; ii++) {
if( array[ii] < array[ii-1] ) printf("Error:\n%d %d\n%d %d\n",ii-1,array[ii-1],ii,array[ii]);
}
free(array);
free(workspace);
return 0;
}
Notice that if you seek performances you also have to guarantee that the base case of your recursion is coarse enough to avoid substantial overhead due to recursive function calls. Other than that, I would suggest to profile your code so you can have a good hint on which parts are really worth optimizing.
It took some figuring out, which is a bit embarassing, since when you see it, the answer is so simple.
As it stands in the question, the program doesn't work correctly, instead it randomly on some runs duplicates some numbers and loses others. This appears to be a totally parallel error, that doesn't arise when running the program with the variable thread_count == 1.
The pragma "parallel sections", is a combined parallel and sections directive, which in this case means, that it starts a second parallel region inside the previous one. Parallel regions inside other parallel regions are fine, but I think most implementation don't give you extra threads when they encounter a nested parallel region.
The fix is to replace
#pragma omp parallel sections
with
#pragma omp sections
After this fix, the program starts to give correct answers, and with a two core system and for a million numbers I get for timing the following results.
One thread:
time taken: 0.378794
Two threads:
time taken: 0.203178
Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause, so change num_threads(thread_count) -> num_threads(2)
But because of the fact that at least the two implementations I tried are not able to spawn new threads for nested parallel regions, the program as it stands doesn't scale to more than two threads.