OpenMP parallel program of Fibonacci is slower than sequential - c

I have this sequential code:
int fib(int n) {
int x, y;
if (n < 2)
return n;
x = fib(n-1);
y = fib(n-2);
return x + y;
}
And this parallel code:
int fib(int n) {
int x, y;
if (n < 2)
return n;
#pragma omp task shared(x)
x = fib(n-1);
#pragma omp task shared(y)
y = fib(n-2);
#pragma omp taskwait
return x + y;
}
The openmp parallel code is slower than serial. I use tdm-gcc 7.4. I have no other program open at Fibonacci runtime. What's wrong?

Related

Counter inside nested loops with OpenMP

I use counters in my c program for making raster image to see image statistic : relative size of each basin ( pixel counting)
My program is long so I have made Minimal, Reproducible Example.
First progrm withou OpenMP which shows what I want to achive ( number of all pixels):
#include <stdio.h>
#include <stdlib.h>
int main()
{
int x;
int xMax = 100;
int y;
int yMax = 100;
int i= 0;
for (x = 0; x < xMax; x++) {
for (y = 0; y < yMax; y++)
{i++;}
}
printf("i = %d \t xMax*yMax = %d\n", i, xMax*yMax);
return 0;
}
It counts pixels(x,y) properly:
i = xMax*yMax
When I add OpenMP then it is not so easy, but reduction helps
#include <stdio.h>
#include <stdlib.h>
#include <omp.h> // OpenMP
int i= 0;
int main()
{
int x;
int xMax = 1000;
int y;
int yMax = 1000;
int all = xMax*yMax;
#pragma omp parallel for collapse(2) schedule(dynamic) reduction(+:i)
for (x = 0; x < xMax; x++) {
for (y = 0; y < yMax; y++)
{i++;}
}
printf("i = %d = %f*(xMax*yMax) \t where xMax*yMax = %d\n", i, (double)i/all, all);
return 0;
}
When I hide counter inside another function then counter is not updated properly
#include <stdio.h>
#include <stdlib.h>
#include <omp.h> // OpenMP
int i= 0;
void P(){
i++;
}
int main()
{
int x;
int xMax = 1000;
int y;
int yMax = 1000;
int all = xMax*yMax;
#pragma omp parallel for collapse(2) schedule(dynamic) reduction(+:i)
for (x = 0; x < xMax; x++) {
for (y = 0; y < yMax; y++)
{P();}
}
printf("i = %d = %f*(xMax*yMax) \t where xMax*yMax = %d\n", i, (double)i/all, all);
return 0;
}
Now :
gcc p.c -Wall -fopenmp
./a.out
i = 437534 = 0.437534*(xMax*yMax) where xMax*yMax = 1000000
Problem : counter inside function is not updated properly
Question : What should I change to update counter properly ?
The problem is that reduction(+:i) clause creates a local variable i, and you are supposed to change this local variable. In your code, however, you increment the global one by calling function P, which is not thread safe (it has a race condition when incrementing it). So, you just have to make sure that you increment the local i when calling function P:
void P(int& i){
i++;
}
//in main:
for (y = 0; y < yMax; y++)
{P(i);}
Another (but slower) option is to make function P threadsafe by using an atomic operation. In this case you do not need reduction at all:
void P(){
#pragma omp atomic
i++;
}

OpenMP Paralellize Pi program

I have been trying to parallelize the following code using OpenMP, with no success.
I have searched in the internet several examples, yet none of them give me the same answer after executing the program several times.
#include <stdio.h>
#include <omp.h>
#define NUM_THREADS 2
long num_steps = 100000;
double step = 1.0/100000.0;
int main() {
int i;
double x, pi, sum = 0.0;
for(i = 0; i < num_steps; ++i) {
x = (i-0.5)*step;
sum += 4.0/(1.0+x*x);
}
pi = step*sum;
printf("PI value = %f\n", pi);
}
This is the solution I have so far:
int main (int argc, char **argv){
//Variables
int i=0, aux=0;
double step = 1.0/100000.0;
double x=0.0,
pi=0.0,
sum = 0.0;
#pragma omp parallel shared(sum,i) private(x)
{
x = 0.0;
sum = 0.0;
#pragma omp for
for (i=0; i<num_steps; ++i) {
x = (i-0.5)*step;
#pragma omp critical
sum += 4.0/(1.0+x*x);
}
}
/* All threads join master thread and terminate */
pi= step*sum;
printf("PI value = %f\n", pi);
}
Please consider to use the same instruction for your loop as mentioned in the OpenMP official website: loop parallelism, I had to change many lines in your code, hope it will be a start point for you to get more familiar with OpenMP and Loop Parallelism in C language.
#include <stdio.h>
#include <omp.h>
#define NUM_STEPS 10000000
int main (int argc, char **argv){
//Variables
long int i, num_steps = NUM_STEPS;
double x, step, sum, pi;
sum = 0.0;
step = 1.0 / (double) num_steps;
#pragma omp parallel private(i,x)
{
#pragma omp for reduction(+:sum)
for (i=0; i<num_steps; ++i) {
x = (i+0.5)*steps;
sum += 4.0/(1.0+x*x);
}
}
/* All threads join master thread and terminate */
pi= steps*sum;
printf("PI value = %.24f\n", pi);
The answer was:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
long num_steps = 100000;
double step = 1.0/100000.0;
int main() {
int i;
double x, pi, sum = 0.0;
#pragma omp parallel private(x)
{
#pragma omp for reduction(+:sum)
for(i = 0; i < num_steps; ++i) {
x = (i-0.5)*step;
sum += 4.0/(1.0+x*x);
}
}
pi = step*sum;
printf("PI value = %f\n", pi);
}
Your main problem is that you declare your loop index i as shared. This leads every thread to use the same i in the evaluation. What you actually want to do with OpenMP is to divide the whole range of i in fractions and assign a different fraction to each thread. So, assign your i as private.
Apart from this, you don't need to re-initialize x and sum in the parallel region. After fixing some irrelevant compilation errors, your code should look like this:
#include<stdio.h>
#include <omp.h>
#define NUM_THREADS 2
int main (int argc, char **argv){
//Variables
int i=0, aux=0;
double step = 1.0/100000.0;
long num_steps = 100000;
double x=0.0,
pi=0.0,
sum = 0.0;
#pragma omp parallel shared(sum) private(i,x)
{
#pragma omp for
for (i=0; i<num_steps; ++i) {
x = (i-0.5)*step;
#pragma omp critical
sum += 4.0/(1.0+x*x);
}
}
/* All threads join master thread and terminate */
pi= step*sum;
printf("PI value = %f\n", pi);
}
Keep in mind that this is far from perfect in terms of performance, since every time you want to update the sum you pause the whole parallel region. A first step to make your code faster is by removing the critical part and declaring the sum as a reduction instead:
#pragma omp parallel private(i,x)
{
#pragma omp for reduction(+:sum)
for (i=0; i<num_steps; ++i) {
x = (i-0.5)*step;
sum += 4.0/(1.0+x*x);
}
}

Multithreaded recursive fibonacci with OpenMP Tasks

I know fibonacci is fundamentally sequential. But I just want to test OpenMP Tasks for the recursive implementation of fibonacci series. The following code in C works fine but my Problem is, instead of getting faster results with more threads, it gets worse. Why? You can try it on your self. I want best scalability.
Compile this code with "gcc -O3 -fopenmp -o fib fib.c" and run it.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
double serialFib(int n, double* a) {
if (n < 2) {
a[n] = n;
return n;
}
double x = serialFib(n - 1, a);
double y = serialFib(n - 2, a);
a[n] = x + y;
return x + y;
}
double fib(int n, double* a) {
if (n < 2) {
a[n] = n;
return n;
}
if (n <= 30) { // avoid task creation overhead
return serialFib(n, a);
}
double x, y;
#pragma omp task shared(x, a) firstprivate(n)
{
x = fib(n - 1, a);
}
#pragma omp task shared(y, a) firstprivate(n)
{
y = fib(n - 2, a);
}
#pragma omp taskwait
a[n] = x + y;
return x + y;
}
int main(int argc, char *argv[]) {
double t0, t1;
// To test scalability of recursive approach
// we take N = 40. Otherwise it will take too long.
int N = 40, i, nthreads;
printf("Starting benchmark...\n");
nthreads = atoi(argv[1]);
omp_set_num_threads(nthreads);
double* a = (double *) calloc(N, sizeof(double));
t0 = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
fib(N, a);
}
}
t1 = omp_get_wtime();
for (i = 0; i < N; ++i) {
printf("a[%d] = %.2f\n", i, a[i]);
}
printf("Execution time: %f\n", t1 - t0);
free(a);
return 0;
}

bad speedup on simple OpenMP saxpy

I am having trouble getting a simple SAXPY program to scale its performance decently using OpenMP.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char** argv){
int N = atoi(argv[1]), threads = atoi(argv[2]), i;
omp_set_num_threads(threads);
double a = 3.141592, *x, *y, t1, t2;
x = (double*)malloc(sizeof(double)*N);
y = (double*)malloc(sizeof(double)*N);
for(i = 0; i < N; ++i){
x[i] = y[i] = (double)i;
}
t1 = omp_get_wtime();
#pragma omp parallel for default(none) private(i) shared(a, N, x,y)
for(i = 0; i < N; ++i){
y[i] = a*x[i] + y[i];
}
t2 = omp_get_wtime();
printf("%f secs\n", t2-t1);
}
I am compiling as:
gcc main.c -lm -O3 -fopenmp -o prog
And the performance I get by for 10M elements is:
threads = 1 0.015097 secs
threads = 2 0.013954 secs
Any idea what is the problem I am having?
You forgot the for in your #pragma omp directive:
#pragma omp parallel for default(none) private(i) shared(a, N, x,y)
Without the for there is no work-sharing, each thread is going to iterate throughout the full range [1, N)

not getting the correct sum - openmp

When I run this code I am getting 2542199.979500 as the answer. However, the correct answer is 1271099.989750. Could someone please tell me where the error is?
This is the code which contains the bug:
#include <omp.h>
#define N 1000
main ()
{
int i, nthreads;
int chunk = 10;
float a[N], b[N], c[N], d[N];
double result;
#pragma omp parallel
{
nthreads = omp_get_num_threads();
printf("no of threads %d", nthreads);
#pragma for shared(a,b,c,d,result) private(i) schedule(static,chunk)
for (i=0; i < N; i++){
a[i] = i * 1.5;
b[i] = i + 22.35;
}
#pragma for shared(a,b,c,d,result) private(i) schedule(static,chunk)
for(i=0; i < N; i++){
result = result + (a[i]+b[i]);
}
}
printf("value is %f", result);
}
Furthermore, when the number of threads is 3 I get
3813299.969250
The result depends on the number of threads used. Could this be a bug in openmp, or am I doing something wrong?
I suggest at least the following two changes...
for the declaration of result...
// result should be initialized
double result = 0;
For your final pragma...
// specify the "reduction"
#pragma omp parallel for reduction(+:result)
Without specifying the "reduction", the summation to result is invalid since result would be modified independently in each thread -- resulting in a race condition.
See http://en.wikipedia.org/wiki/OpenMP#Reduction
#include <stdio.h>
#include <omp.h>
#define N 1000
int main ()
{
int i, nthreads;
int chunk = 10;
float a[N], b[N], c[N], d[N];
double result=0;
#pragma omp parallel
nthreads = omp_get_num_threads();
printf("no of threads %d\n", nthreads);
#pragma omp parallel for
for (i=0; i < N; i++){
a[i] = i * 1.5;
b[i] = i + 22.35;
}
#pragma omp parallel for reduction(+:result)
for(i=0; i < N; i++){
result = result + (a[i]+b[i]);
}
printf("value is %f", result);
return 0;
}
Please see comments inline.
// openmp.c
#include <stdio.h>
#include <omp.h>
#define N 1000
// main should return a int
int main(){
int i, nthreads;
float a[N], b[N];
// give result a initial value !
double result = 0;
#pragma omp parallel
{
nthreads = omp_get_num_threads();
// just print numthreads ONCE
#pragma omp single
printf("no. of threads %d\n", nthreads);
#pragma omp for
for (int i = 0; i < N; i++) {
a[i] = i *1.5;
b[i] = i + 22.35;
}
#pragma omp for
for (int i = 0; i < N; i++) {
double sum = a[i] + b[i];
// atomic operation needed !
#pragma omp atomic
result += sum;
}
#pragma omp single
printf("result = %f\n", result);
}
return 0;
}
Compile using cc -fopenmp -std=gnu99 openmp.c, the output is:
no. of threads 4
result = 1271099.989750
In openMP one should try to minimise the parallel regions, in this case one is possible and hence enough. Here is a simple C++ version doing just that.
#include <iostream>
#include <iomanip>
#include <omp.h>
const int N=1000;
int main ()
{
const double A = 22.35;
const double B = 1.5;
double a[N], b[N], c[N], d[N];
double result=0;
#pragma omp parallel
{ // begin parallel region
#pragma omp master
std::cout << "no of threads: " << omp_get_num_threads() << std::endl;
// this loop and the following could be merged and the arrays avoided.
#pragma omp for
for(int i=0; i<N; ++i) {
a[i] = i * B;
b[i] = i + A;
}
#pragma omp for reduction(+:result)
for(int i=0; i<N; ++i)
result += a[i]+b[i];
} // end parallel region
double answer = N*(A+0.5*(B+1)*(N-1));
std::cout << "computed result = " << std::setprecision(16) << result
<< '\n'
<< "correct answer = " << std::setprecision(16) << answer
<< std::endl;
return 0;
}
I get (using gcc 4.6.2 on Mac OS X 10.6.8):
no of threads: 2
computed result = 1271099.999999993
correct answer = 1271100

Resources