Counter inside nested loops with OpenMP

Counter inside nested loops with OpenMP - c

I use counters in my c program for making raster image to see image statistic : relative size of each basin ( pixel counting)
My program is long so I have made Minimal, Reproducible Example.
First progrm withou OpenMP which shows what I want to achive ( number of all pixels):
#include <stdio.h>
#include <stdlib.h>
int main()
{
int x;
int xMax = 100;
int y;
int yMax = 100;
int i= 0;
for (x = 0; x < xMax; x++) {
for (y = 0; y < yMax; y++)
{i++;}
}
printf("i = %d \t xMax*yMax = %d\n", i, xMax*yMax);
return 0;
}
It counts pixels(x,y) properly:
i = xMax*yMax
When I add OpenMP then it is not so easy, but reduction helps
#include <stdio.h>
#include <stdlib.h>
#include <omp.h> // OpenMP
int i= 0;
int main()
{
int x;
int xMax = 1000;
int y;
int yMax = 1000;
int all = xMax*yMax;
#pragma omp parallel for collapse(2) schedule(dynamic) reduction(+:i)
for (x = 0; x < xMax; x++) {
for (y = 0; y < yMax; y++)
{i++;}
}
printf("i = %d = %f*(xMax*yMax) \t where xMax*yMax = %d\n", i, (double)i/all, all);
return 0;
}
When I hide counter inside another function then counter is not updated properly
#include <stdio.h>
#include <stdlib.h>
#include <omp.h> // OpenMP
int i= 0;
void P(){
i++;
}
int main()
{
int x;
int xMax = 1000;
int y;
int yMax = 1000;
int all = xMax*yMax;
#pragma omp parallel for collapse(2) schedule(dynamic) reduction(+:i)
for (x = 0; x < xMax; x++) {
for (y = 0; y < yMax; y++)
{P();}
}
printf("i = %d = %f*(xMax*yMax) \t where xMax*yMax = %d\n", i, (double)i/all, all);
return 0;
}
Now :
gcc p.c -Wall -fopenmp
./a.out
i = 437534 = 0.437534*(xMax*yMax) where xMax*yMax = 1000000
Problem : counter inside function is not updated properly
Question : What should I change to update counter properly ?

The problem is that reduction(+:i) clause creates a local variable i, and you are supposed to change this local variable. In your code, however, you increment the global one by calling function P, which is not thread safe (it has a race condition when incrementing it). So, you just have to make sure that you increment the local i when calling function P:
void P(int& i){
i++;
}
//in main:
for (y = 0; y < yMax; y++)
{P(i);}
Another (but slower) option is to make function P threadsafe by using an atomic operation. In this case you do not need reduction at all:
void P(){
#pragma omp atomic
i++;
}

Related

C: give 2D array to function but the identifier "a" is not defined

I want to create 2 matrices and fill them with reandom numbers 0-9.
I just don't understand why my function doesn`t work like this.
If I define a and b with e.g. #define a = 3 it works.
So the problem occurs at:
void fillM(int array[a][b])
and
void printM(int array[a][b])
Original code
#include <stdio.h>
#include <float.h>
#include <stdbool.h>
#include <stdlib.h>
#include <time.h>
//fill random
void fillM(int array[a][b]) {
for (int x=0; x < a; x++) {
for (int y=0; y < b; y++) {
array[x][y] = rand()%10;
}
}
}
//print
void printM(int array[a][b]){
for (int x=0; x < a; x++){
for (int y=0; y < b; y++) {
printf("%d ", array[a][b]);
}
printf("\n");
}
}
int Main(){
//do I really need this?
srand (time(NULL));
//set size
int n;
printf("please set size of n x n Matrix: \n");
scanf("%d", &n);
int m = n;
//initialise
int m1[n][m];
int m2[n][m];
fillM (m1);
fillM (m2);
return 0;
}
Revised code
#include <float.h>
#include <stdbool.h>
#include <stdlib.h>
#include <time.h>
#include <stdio.h>
//fill random
void fillM(size_t a, size_t b, int array[a][b]) {
for (int x=0; x < a; x++) {
for (int y=0; y < b; y++) {
array[x][y] = rand()%10;
}
}
}
//print
void printM(size_t a, size_t b, int array[a][b]){
for (int x=0; x < a; x++){
for (int y=0; y < b; y++) {
printf("%d ", array[a][b]);
}
printf("\n");
}
printf("\n");
}
int main(){
srand (time(NULL));
//set size
int n;
printf("please set size of n x n Matrix: \n");
scanf("%d", &n);
printf("\n");
int m = n;
//initialise
int m1[n][m];
int m2[n][m];
fillM (n, m, m1);
fillM (n, m, m2);
printM (n, m, m1);
printM (n, m, m2);
return 0;
}
But one more question. If I run the program now, it doesn´t fill the matrix with random numbers everywhere. It puts the same random number in every place. Do you know how to fix this?

At the point where you use a and b, they are not defined. You need something more like:
void fill(size_t a, size_t b, int array[a][b]
Your calls will pass the array size as well.
In your revised code, you get the same answer for every printed value because you attempt to print the same element of the array, array[a][b] — except that it isn't an element of the array but is a long way out of bounds because the array indexes run from 0..a-1 and 0..b-1. Use array[x][y] instead.

How to parallelize the below for loop

I am trying to integrate a function of curve, and convert the serial code to parallel program, I am using openMP for the same.
I have parallelized the for loop using openMP parallel for and have achieved lesser program time, but the problem is the result is not the expected one, there is something which get messed up in the threads, I want to know how to parallelize the for loop for N number of threads.
#include <stdio.h>
#include <omp.h>
#include <math.h>
double f(double x){
return sin(x)+0.5*x;
}
int main(){
int n=134217728,i;
double a=0,b=9,h,x,sum=0,integral;
double start = omp_get_wtime();
h=fabs(b-a)/n;
omp_set_dynamic(0);
omp_set_num_threads(64);
#pragma omp parallel for reduction (+:sum) shared(x)
for(i=1;i<n;i++){
x=a+i*h;
sum=sum+f(x);
}
integral=(h/2)*(f(a)+f(b)+2*sum);
double end = omp_get_wtime();
double time = end - start;
printf("Execution time: %2.3f seconds\n",time);
printf("\nThe integral is: %lf\n",integral);
}
The expected output is 22.161130 but it is getting varied each time the program is ran.

The loop you are trying to parallelise modifies the same variables x and sum in each iteration, this is very cumbersome to parallelize.
You could rewrite the code to make the path to parallelisation more obvious:
#include <stdio.h>
#include <omp.h>
#include <math.h>
double f(double x) {
return sin(x) + 0.5 * x;
}
int main() {
int n = 1 << 27, i, j;
double a = 0, b = 9, h, x, sum, integral;
double sums[64] = { 0 };
double start = omp_get_wtime();
h = fabs(b - a) / n;
omp_set_dynamic(0);
omp_set_num_threads(64);
#pragma omp parallel for
for (j = 0; j < 64; j++) {
for (i = 0; i < n; i += 64) {
sums[j] += f(a + i * h + j * h);
}
}
sum = 0;
for (j = 0; j < 64; j++) {
sum += sums[i];
}
integral = (h / 2) * (f(a) + f(b) + 2 * sum);
double end = omp_get_wtime();
double time = end - start;
printf("Execution time: %2.3f seconds\n", time);
printf("\nThe integral is: %lf\n", integral);
return 0;
}

Multithreaded recursive fibonacci with OpenMP Tasks

I know fibonacci is fundamentally sequential. But I just want to test OpenMP Tasks for the recursive implementation of fibonacci series. The following code in C works fine but my Problem is, instead of getting faster results with more threads, it gets worse. Why? You can try it on your self. I want best scalability.
Compile this code with "gcc -O3 -fopenmp -o fib fib.c" and run it.
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
double serialFib(int n, double* a) {
if (n < 2) {
a[n] = n;
return n;
}
double x = serialFib(n - 1, a);
double y = serialFib(n - 2, a);
a[n] = x + y;
return x + y;
}
double fib(int n, double* a) {
if (n < 2) {
a[n] = n;
return n;
}
if (n <= 30) { // avoid task creation overhead
return serialFib(n, a);
}
double x, y;
#pragma omp task shared(x, a) firstprivate(n)
{
x = fib(n - 1, a);
}
#pragma omp task shared(y, a) firstprivate(n)
{
y = fib(n - 2, a);
}
#pragma omp taskwait
a[n] = x + y;
return x + y;
}
int main(int argc, char *argv[]) {
double t0, t1;
// To test scalability of recursive approach
// we take N = 40. Otherwise it will take too long.
int N = 40, i, nthreads;
printf("Starting benchmark...\n");
nthreads = atoi(argv[1]);
omp_set_num_threads(nthreads);
double* a = (double *) calloc(N, sizeof(double));
t0 = omp_get_wtime();
#pragma omp parallel
{
#pragma omp single
{
fib(N, a);
}
}
t1 = omp_get_wtime();
for (i = 0; i < N; ++i) {
printf("a[%d] = %.2f\n", i, a[i]);
}
printf("Execution time: %f\n", t1 - t0);
free(a);
return 0;
}

It's really strange that sometimes gcc can't find reference of sqrt but sometimes gcc can

I tried this code
/*main.c*/
#include <stdio.h> /* printf */
#include <math.h> /* sqrt */
int frequency_of_primes (int n) {
int i, j;
int freq = n - 1;
for (i = 2; i <= n; ++i)
for (j = sqrt(i); j > 1; --j)
if (i%j==0) {--freq; break;}
return freq;
}
int main() {
printf("%f\n", sqrt(4.0));
return 0;
}
and compiled it with gcc main.c, it reported that undefined reference tosqrt'. I already know add-lm` option can resolve this issue. But what really surprises me is this:
#include <stdio.h> /* printf */
#include <math.h> /* sqrt */
// int frequency_of_primes (int n) {
// int i, j;
// int freq = n - 1;
// for (i = 2; i <= n; ++i)
// for (j = sqrt(i); j > 1; --j)
// if (i%j==0) {--freq; break;}
// return freq;
// }
int main() {
printf("%f\n", sqrt(4.0));
return 0;
}
The main function also calls sqrt, but ld doesn't report any errors.

That's because the optimizer is handling the constant case you're using.
It's the sqrt(i) call inside frequency_of_primes() that's the problem, the call in main() is optimized out. You can figure that out by reading the generated code for the latter case, it'll just load a constant 2.0 and be done with it.

not getting the correct sum - openmp

When I run this code I am getting 2542199.979500 as the answer. However, the correct answer is 1271099.989750. Could someone please tell me where the error is?
This is the code which contains the bug:
#include <omp.h>
#define N 1000
main ()
{
int i, nthreads;
int chunk = 10;
float a[N], b[N], c[N], d[N];
double result;
#pragma omp parallel
{
nthreads = omp_get_num_threads();
printf("no of threads %d", nthreads);
#pragma for shared(a,b,c,d,result) private(i) schedule(static,chunk)
for (i=0; i < N; i++){
a[i] = i * 1.5;
b[i] = i + 22.35;
}
#pragma for shared(a,b,c,d,result) private(i) schedule(static,chunk)
for(i=0; i < N; i++){
result = result + (a[i]+b[i]);
}
}
printf("value is %f", result);
}
Furthermore, when the number of threads is 3 I get
3813299.969250
The result depends on the number of threads used. Could this be a bug in openmp, or am I doing something wrong?

I suggest at least the following two changes...
for the declaration of result...
// result should be initialized
double result = 0;
For your final pragma...
// specify the "reduction"
#pragma omp parallel for reduction(+:result)
Without specifying the "reduction", the summation to result is invalid since result would be modified independently in each thread -- resulting in a race condition.
See http://en.wikipedia.org/wiki/OpenMP#Reduction
#include <stdio.h>
#include <omp.h>
#define N 1000
int main ()
{
int i, nthreads;
int chunk = 10;
float a[N], b[N], c[N], d[N];
double result=0;
#pragma omp parallel
nthreads = omp_get_num_threads();
printf("no of threads %d\n", nthreads);
#pragma omp parallel for
for (i=0; i < N; i++){
a[i] = i * 1.5;
b[i] = i + 22.35;
}
#pragma omp parallel for reduction(+:result)
for(i=0; i < N; i++){
result = result + (a[i]+b[i]);
}
printf("value is %f", result);
return 0;
}

Please see comments inline.
// openmp.c
#include <stdio.h>
#include <omp.h>
#define N 1000
// main should return a int
int main(){
int i, nthreads;
float a[N], b[N];
// give result a initial value !
double result = 0;
#pragma omp parallel
{
nthreads = omp_get_num_threads();
// just print numthreads ONCE
#pragma omp single
printf("no. of threads %d\n", nthreads);
#pragma omp for
for (int i = 0; i < N; i++) {
a[i] = i *1.5;
b[i] = i + 22.35;
}
#pragma omp for
for (int i = 0; i < N; i++) {
double sum = a[i] + b[i];
// atomic operation needed !
#pragma omp atomic
result += sum;
}
#pragma omp single
printf("result = %f\n", result);
}
return 0;
}
Compile using cc -fopenmp -std=gnu99 openmp.c, the output is:
no. of threads 4
result = 1271099.989750

In openMP one should try to minimise the parallel regions, in this case one is possible and hence enough. Here is a simple C++ version doing just that.
#include <iostream>
#include <iomanip>
#include <omp.h>
const int N=1000;
int main ()
{
const double A = 22.35;
const double B = 1.5;
double a[N], b[N], c[N], d[N];
double result=0;
#pragma omp parallel
{ // begin parallel region
#pragma omp master
std::cout << "no of threads: " << omp_get_num_threads() << std::endl;
// this loop and the following could be merged and the arrays avoided.
#pragma omp for
for(int i=0; i<N; ++i) {
a[i] = i * B;
b[i] = i + A;
}
#pragma omp for reduction(+:result)
for(int i=0; i<N; ++i)
result += a[i]+b[i];
} // end parallel region
double answer = N*(A+0.5*(B+1)*(N-1));
std::cout << "computed result = " << std::setprecision(16) << result
<< '\n'
<< "correct answer = " << std::setprecision(16) << answer
<< std::endl;
return 0;
}
I get (using gcc 4.6.2 on Mac OS X 10.6.8):
no of threads: 2
computed result = 1271099.999999993
correct answer = 1271100