I am trying to parallelize this recursive function with openmp:
#include <stdio.h>
#include <omp.h>
void rec(int from, int to){
int len=to-from;
printf("%X %x %X %d\n", from, to, len, omp_get_thread_num());
if (len > 1){
int mid = (from+to)/2;
#pragma omp task
rec(from, mid);
#pragma omp task
rec(mid, to);
}
}
int main(int argc, char *argv[]){
long len=1024;
#pragma omp parallel
#pragma omp single
rec(0, len);
return 0;
}
But when I run it I get segfault:
$g++ -fopenmp -Wall -pedantic -lefence -g -O0 test.cpp && ./a.out
0 400 400 0
0 200 200 1
200 400 200 0
Segmentation fault
When I run it in valgrind it shows no errors. Without -lefence it also work.
I tried all possible combinations of #pragma omp clauses and it is either one-threaded or segfault.
What is wrong?
Thanks a lot.
Related
According to the OpenACC documentation:
copyin - Create space for the listed variables on the device, initialize the variable by copying
data to the device at the beginning of the region, and release the space on the device when
done without copying the data back the the host.
I've created a test example program
int main(int argc, char** argv)
{
int teste[] = { -15 };
#pragma acc data copyin(teste[0:1])
{
#pragma acc parallel loop
for (int p = 0; p < 5000; p++) {
teste[0] = p;
}
}
printf("%d", teste[0]);
return 0;
}
According to the Docs the program should output -15 since the data is modified on the device and the result is not copied back to the host. But once I compile and run this code, the output is 4999
My compiler is gcc (tdm64-1) 10.3.0 and I'm running the program at a computer with separate device and host memory
I'd like to know why is this not working, and what could I do to prevent the copy from the device back to the host.
Here's the program running using git bash on windows:
$ cat test.c && echo "" &&gcc -fopenacc test.c && ./a.exe
#include <stdio.h>
int main(int argc, char** argv)
{
int teste[] = { -15 };
#pragma acc data copyin(teste[0:1])
{
#pragma acc parallel loop
for (int p = 0; p < 5000; p++) {
teste[0] = p;
}
}
printf("%d\n", teste[0]);
return 0;
}
4999
I also got access to a Linux Machine, and even using nvc I could not get the correct results
cat test.c && echo "" && /opt/nvidia/hpc_sdk/Linux_x86_64/2021/compilers/bin/nvc -acc -Minfo=accel test.c && ./a.out
#include <stdio.h>
int main(int argc, char** argv)
{
int teste[] = { -15 };
#pragma acc data copyin(teste[0:1])
{
#pragma acc parallel loop
for (int p = 0; p < 5000; p++) {
teste[0] = p;
}
}
printf("%d\n", teste[0]);
return 0;
}
main:
9, Generating copyin(teste[:]) [if not already present]
Generating NVIDIA GPU code
12, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
4999
The program should print -15 since the value isn't changed on the host. Hence this is either a bug in gcc or you're not actually enabling OpenACC. What compiler flags are you using?
Here's the output using nvc targeting an NVIDIA A100:
% cat test.c
#include <stdio.h>
int main(int argc, char** argv)
{
int teste[] = { -15 };
#pragma acc data copyin(teste[0:1])
{
#pragma acc parallel loop
for (int p = 0; p < 5000; p++) {
teste[0] = p;
}
}
printf("%d\n", teste[0]);
return 0;
}
% nvc test.c -acc -Minfo=accel ; a.out
main:
10, Generating copyin(teste[:]) [if not already present]
Generating NVIDIA GPU code
13, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
-15
I keep getting this error for >6 hours now when trying to compile C-code with -fopenmp flag using gcc.
error: invalid controlling predicate
for ( int i = 0; i < N; i++ )
I browsed stackoverflow and I stripped down my code up till the point where it is an exact copy of an example from an OpenMP handbook, but still it doesn't compile.
#include <stdio.h>
#include <math.h>
#ifdef _OPENMP
#include <omp.h>
#endif
int main(int argc, char *argv[]) {
double N; sscanf (argv[1]," %lf", &N);
double integral = 0.0;
#pragma omp parallel for reduction(+: integral)
for ( int i = 0; i < N; i++ )
integral = integral + i;
printf("%20.18lf\n", integral);
return 0;
}
Any suggestions..?
Found it, sorry for the clutter..
To all other C newbies like myself: The error was in the double N. OpenMP wants your loop to run op to an INTEGER N, and not a double.
I installed both gcc-7, gcc-8, gcc-7-offload-nvptx and gcc-8-offload-nvptx
I tried with both to compile a simple OpenMP code with offloading:
#include <omp.h>
#include <stdio.h>
int main(){
#pragma omp target
#pragma omp teams distribute parallel for
for (int i=0; i<omp_get_num_threads(); i++)
printf("%d in %d of %d\n",i,omp_get_thread_num(), omp_get_num_threads());
}
With the following line (with gcc-7 too):
gcc-8 code.c -fopenmp -foffload=nvptx-none
But it doesn't compile, giving the following error:
/tmp/ccKESWcF.o: In function "main":
teste.c:(.text+0x50): undefined reference to "GOMP_target_ext"
/tmp/cc0iOH1Y.target.o: In function "init":
ccPXyu6Y.c:(.text+0x1d): undefined reference to "GOMP_offload_register_ver"
/tmp/cc0iOH1Y.target.o: In function "fini":
ccPXyu6Y.c:(.text+0x41): undefined reference to "GOMP_offload_unregister_ver"
collect2: error: ld returned 1 exit status
some clues?
You code compiles and runs for me using -foffload=disable -fno-stack-protector with gcc7 and gcc-7-offload-nvptx and Ubuntu 17.10.
But on the GPU (without -foffload=disable) it fails to compile. You can't call printf from the GPU. Instead you can do this:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
int main(){
int nthreads;
#pragma omp target teams map(tofrom:nthreads)
#pragma omp parallel
#pragma omp single
nthreads = omp_get_num_threads();
int *ithreads = malloc(sizeof *ithreads *nthreads);
#pragma omp target teams distribute parallel for map(tofrom:ithreads[0:nthreads])
for (int i=0; i<nthreads; i++) ithreads[i] = omp_get_thread_num();
for (int i=0; i<nthreads; i++)
printf("%d in %d of %d\n", i, ithreads[i], nthreads);
free(ithreads);
}
For me this outputs
0 in 0 of 8
1 in 0 of 8
2 in 0 of 8
3 in 0 of 8
4 in 0 of 8
5 in 0 of 8
6 in 0 of 8
7 in 0 of 8
I wrote following code. And I compiled and run the program. Segmentation fault occurred when calling mpif_set_si. But I can't understand why segmentation fault occur.
OS: Mac OS X 10.9.2
Compiler: i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)
#include <stdio.h>
#include <gmp.h>
#include <math.h>
#define NUM_ITTR 1000000
int
main(void)
{
unsigned long int i, begin, end , perTh;
mpf_t pi, gbQuaterPi, quaterPi, pw, tmp;
int tn, nt;
mpf_init(quaterPi);
mpf_init(gbQuaterPi);
mpf_init(pw);
mpf_init(tmp);
mpf_init(pi);
#pragma omp parallel private(tmp, pw, quaterPi, tn, begin, end, i)
{
#ifdef OMP
tn = omp_get_thread_num();
nt = omp_get_num_threads();
perTh = NUM_ITTR / nt;
begin = perTh * tn;
end = begin + perTh - 1;
#else
begin = 0;
end = NUM_ITTR - 1;
#endif
for(i=begin;i<=end;i++){
printf("Before set begin=%lu %lu tn= %d\n", begin, end, tn);
mpf_set_si(tmp, -1); // segmentation fault occur
printf("After set begin=%lu %lu tn= %d\n", begin, end, tn);
mpf_pow_ui(pw, tmp, i);
mpf_set_si(tmp, 2);
mpf_mul_ui(tmp, tmp, i);
mpf_add_ui(tmp, tmp, 1);
mpf_div(tmp, pw, tmp);
mpf_add(quaterPi, quaterPi, tmp);
}
#pragma omp critical
{
mpf_add(gbQuaterPi, gbQuaterPi, quaterPi);
}
}
mpf_mul_ui(pi, gbQuaterPi, 4);
gmp_printf("pi= %.30ZFf\n", pi);
mpf_clear(pi);
mpf_clear(tmp);
mpf_clear(pw);
mpf_clear(quaterPi);
mpf_clear(gbQuaterPi);
return 0;
}
-Command line-
$ setenv OMP_NUM_THREADS 2
$ gcc -g -DOMP -I/opt/local/include -fopenmp -o calcpi calcpi.c -lgmp -L/opt/local/lib
$ ./calcpi
Before set begin=0 499999 tn= 0
Before set begin=500000 999999 tn= 1
After set begin=1 999999 tn= 1
Segmentation fault
private variables are not initialised, so they can have any value after the start of the parallel section. Initialising a value inside the parallel block can work, but often isn't efficient.
Usually a better way is to use firstprivate instead of private, which will initialise variables with the value they had before the parallel region.
What I am looking for is what is the best way to gather all the data from the parallel for loops into one variable. OpenMP seems to have a different routine then I am used to seeing as I started learning OpenMPI first which has scatter and gather routines.
Calculating PI (embarrassingly parallel routine)
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define NUM_STEPS 100
#define CHUNKSIZE 20
int main(int argc, char *argv[])
{
double step, x, pi, sum=0.0;
int i, chunk;
chunk = CHUNKSIZE;
step = 1.0/(double)NUM_STEPS;
#pragma omp parallel shared(chunk) private(i,x,sum,step)
{
#pragma omp for schedule(dynamic,chunk)
for(i = 0; i < NUM_STEPS; i++)
{
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
printf("Thread %d: i = %i sum = %f \n",tid,i,sum);
}
pi = step * sum;
}
EDIT: It seems that I could use an array sum[*NUM_STEPS / CHUNKSIZE*] and sum the array into one value, or would it be better to use some sort of blocking routine to sum the product of each iteration
Add this clause to your #pragma omp parallel ... statement:
reduction(+ : pi)
Then just do pi += step * sum; at the end of the parallel region. (Notice the plus!) OpenMP will then automagically sum up the partial sums for you.
Lets see, I am not quite sure what happens, because I havn't got deterministic behaviour on the finished application, but I have something looks like it resembles π. I removed the #pragma omp parallel shared(chunk) and changed the #pragma omp for schedule(dynamic,chunk) to #pragma omp parallel for schedule(dynamic) reduction(+:sum).
#pragma omp parallel for schedule(dynamic) reduction(+:sum)
This requires some explanation, I removed the schedules chunk just to make it all simpler (for me). The part that you are interested in is the reduction(+:sum) which is a normal reduce opeartion with the operator + and using the variable sum.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#define NUM_STEPS 100
int main(int argc, char *argv[])
{
double step, x, pi, sum=0.0;
int i;
step = 1.0/(double)NUM_STEPS;
#pragma omp parallel for schedule(dynamic) reduction(+:sum)
for(i = 0; i < NUM_STEPS; i++)
{
x = (i+0.5)*step;
sum +=4.0/(1.0+x*x);
printf("Thread %%d: i = %i sum = %f \n",i,sum);
}
pi = step * sum;
printf("pi=%lf\n", pi);
}