I wrote following code. And I compiled and run the program. Segmentation fault occurred when calling mpif_set_si. But I can't understand why segmentation fault occur.
OS: Mac OS X 10.9.2
Compiler: i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)
#include <stdio.h>
#include <gmp.h>
#include <math.h>
#define NUM_ITTR 1000000
int
main(void)
{
unsigned long int i, begin, end , perTh;
mpf_t pi, gbQuaterPi, quaterPi, pw, tmp;
int tn, nt;
mpf_init(quaterPi);
mpf_init(gbQuaterPi);
mpf_init(pw);
mpf_init(tmp);
mpf_init(pi);
#pragma omp parallel private(tmp, pw, quaterPi, tn, begin, end, i)
{
#ifdef OMP
tn = omp_get_thread_num();
nt = omp_get_num_threads();
perTh = NUM_ITTR / nt;
begin = perTh * tn;
end = begin + perTh - 1;
#else
begin = 0;
end = NUM_ITTR - 1;
#endif
for(i=begin;i<=end;i++){
printf("Before set begin=%lu %lu tn= %d\n", begin, end, tn);
mpf_set_si(tmp, -1); // segmentation fault occur
printf("After set begin=%lu %lu tn= %d\n", begin, end, tn);
mpf_pow_ui(pw, tmp, i);
mpf_set_si(tmp, 2);
mpf_mul_ui(tmp, tmp, i);
mpf_add_ui(tmp, tmp, 1);
mpf_div(tmp, pw, tmp);
mpf_add(quaterPi, quaterPi, tmp);
}
#pragma omp critical
{
mpf_add(gbQuaterPi, gbQuaterPi, quaterPi);
}
}
mpf_mul_ui(pi, gbQuaterPi, 4);
gmp_printf("pi= %.30ZFf\n", pi);
mpf_clear(pi);
mpf_clear(tmp);
mpf_clear(pw);
mpf_clear(quaterPi);
mpf_clear(gbQuaterPi);
return 0;
}
-Command line-
$ setenv OMP_NUM_THREADS 2
$ gcc -g -DOMP -I/opt/local/include -fopenmp -o calcpi calcpi.c -lgmp -L/opt/local/lib
$ ./calcpi
Before set begin=0 499999 tn= 0
Before set begin=500000 999999 tn= 1
After set begin=1 999999 tn= 1
Segmentation fault
private variables are not initialised, so they can have any value after the start of the parallel section. Initialising a value inside the parallel block can work, but often isn't efficient.
Usually a better way is to use firstprivate instead of private, which will initialise variables with the value they had before the parallel region.
Related
Edit: solved! Windows limits the stack size to where my buffer does not fit; linux does not (additionaly I was accessing memory outside of my array... oops). Using gcc, you can set the stack size like so: gcc -Wl --stack,N [your other flags n stuff] where N is the size of the stack in bytes. Final working compile command: gcc -Wl --stack,8000000 -fopenmp openmp.c -o openmp
An interesting sidenote is that the rand() function seems to produce smaller patterns than in Linux, because I can see patterns (tiling) in the generated noise on Windows, but not on Linux. As always, if you need it to be absolutely random, use a cryptographically secure rand function.
Pre edit:
This piece of code is supposed to make a screenbuffer of randomnoise, then write that to a file. It works on linux (ubuntu 19) but not on windows (8.1).
The error message:
Unhandled exception at 0x0000000000413C46 in openmp.exe:
0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x0000000000043D50).
0000000000413C46 or qword ptr [rcx],0
// gcc -fopenmp openmp.c -o openmp
// ./openmp
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include <stdint.h>
int main(int argc, char **argv)
{
int w = 1920;
int h = 1080;
int thread_id, nloops;
unsigned char buffer[w][h][3]; // 1920 x 1080 pixels, 3 channels
printf("Did setup\n");
#pragma omp parallel private(thread_id, nloops)
{
nloops = 0;
thread_id = omp_get_thread_num();
printf("Thread %d started\n", thread_id);
#pragma omp for
for (int x = 0; x < w; x++){
for (int y = 0; y < h; y++){
nloops++;
unsigned char r = rand();
unsigned char g = rand();
unsigned char b = rand();
buffer[x][y][0] = r;
buffer[x][y][1] = g;
buffer[x][y][2] = b;
}
}
printf("Thread %d performed %d iterations of the loop.\n", thread_id, nloops);
}
FILE* image = fopen("render.ppm","w");
fprintf(image, "P3\n%d %d\n%d\n", w, h, 255);
for (int x = 0; x < w; x++){
for (int y = 0; y < h-1; y++){
fprintf(image, "%d %d %d ", buffer[x][y][0], buffer[x][y][1], buffer[x][y][2]);
}
fprintf(image, "%d %d %d\n", buffer[w][h][0], buffer[w][h][1], buffer[w][h][2]);
}
printf("%fmb\n", ((float)sizeof(buffer))/1000000);
return 0;
}
The local buffer variable wants 1920 * 1080 * 3 (6,220,800) bytes of space. This is more than the default stack size on a Windows application.
If you were using the Microsoft tools, you could use the /STACK linker option to specify a larger stack.
With the GCC toolchain, you can use the --stack,8000000 option to set a larger stack size.
Or you can dynamically allocate space for buffer using malloc.
A third alternative is to use the editbin tool to specify the size after the executable is built.
In
fprintf(image, "%d %d %d\n", buffer[w][h][0], buffer[w][h][1], buffer[w][h][2]);
you are accessing buffer out of bounds. The highest valid indices for buffer are w - 1 and h - 1:
fprintf(image, "%d %d %d\n", buffer[w - 1][h - 1][0], buffer[w - 1][h - 1][1], buffer[w - 1][h - 1][2]);
I'm using the "read" benchmark from Why is writing to memory much slower than reading it?, and I added just two lines:
#pragma omp parallel for
for(unsigned dummy = 0; dummy < 1; ++dummy)
They should have no effect, because OpenMP should only parallelize the outer loop, but the code now consistently runs twice faster.
Update: These lines aren't even necessary. Simply adding
omp_get_num_threads();
(implicitly declared) in the same place has the same effect.
Complete code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
unsigned long do_xor(const unsigned long* p, unsigned long n)
{
unsigned long i, x = 0;
for(i = 0; i < n; ++i)
x ^= p[i];
return x;
}
int main()
{
unsigned long n, r, i;
unsigned long *p;
clock_t c0, c1;
double elapsed;
n = 1000 * 1000 * 1000; /* GB */
r = 100; /* repeat */
p = calloc(n/sizeof(unsigned long), sizeof(unsigned long));
c0 = clock();
#pragma omp parallel for
for(unsigned dummy = 0; dummy < 1; ++dummy)
for(i = 0; i < r; ++i) {
p[0] = do_xor(p, n / sizeof(unsigned long)); /* "use" the result */
printf("%4ld/%4ld\r", i, r);
fflush(stdout);
}
c1 = clock();
elapsed = (c1 - c0) / (double)CLOCKS_PER_SEC;
printf("Bandwidth = %6.3f GB/s (Giga = 10^9)\n", (double)n * r / elapsed / 1e9);
free(p);
}
Compiled and executed with
gcc -O3 -Wall -fopenmp single_iteration.c && time taskset -c 0 ./a.out
The wall time reported by time is 3.4s vs 7.5s.
GCC 7.3.0 (Ubuntu)
The reason for the performance difference is not actually any difference in code, but in how memory is mapped. In the fast case you are reading from zero-pages, i.e. all virtual addresses are mapped to a single physical page - so nothing has to be read from memory. In the slow case, it is not zeroed. For details see this answer from a slightly different context.
On the other side, it is not caused by calling omp_get_num_threads or the pragma itstelf, but merely linking to the OpenMP runtime library. You can confirm that by using -Wl,--no-as-needed -fopenmp. If you just specify -fopenmp but don't use it at all, the linker will omit it.
Now unfortunately I am still missing the final puzzle piece: why does linking to OpenMP change the behavior of calloc regarding zero'd pages .
I test the following simple function
void mul(double *a, double *b) {
for (int i = 0; i<N; i++) a[i] *= b[i];
}
with very large arrays so that it is memory bandwidth bound. The test code I use is below. When I compile with -O2 it takes 1.7 seconds. When I compile with -O2 -mavx it takes only 1.0 seconds. The non vex-encoded scalar operations are 70% slower! Why is this?
Here is the the assembly for -O2 and -O2 -mavx.
https://godbolt.org/g/w4p60f
System: i7-6700HQ#2.60GHz (Skylake) 32 GB mem, Ubuntu 16.10, GCC 6.3
Test code
//gcc -O2 -fopenmp test.c
//or
//gcc -O2 -mavx -fopenmp test.c
#include <string.h>
#include <stdio.h>
#include <x86intrin.h>
#include <omp.h>
#define N 1000000
#define R 1000
void mul(double *a, double *b) {
for (int i = 0; i<N; i++) a[i] *= b[i];
}
int main() {
double *a = (double*)_mm_malloc(sizeof *a * N, 32);
double *b = (double*)_mm_malloc(sizeof *b * N, 32);
//b must be initialized to get the correct bandwidth!!!
memset(a, 1, sizeof *a * N);
memset(b, 1, sizeof *b * N);
double dtime;
const double mem = 3*sizeof(double)*N*R/1024/1024/1024;
const double maxbw = 34.1;
dtime = -omp_get_wtime();
for(int i=0; i<R; i++) mul(a,b);
dtime += omp_get_wtime();
printf("time %.2f s, %.1f GB/s, efficency %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw);
_mm_free(a), _mm_free(b);
}
The problem is related to a dirty upper half of an AVX register after calling omp_get_wtime(). This is a problem particularly for Skylake processors.
The first time I read about this problem was here. Since then other people have observed this problem: here and here.
Using gdb I found that omp_get_wtime() calls clock_gettime. I rewrote my code to use clock_gettime() and I see the same problem.
void fix_avx() { __asm__ __volatile__ ( "vzeroupper" : : : ); }
void fix_sse() { }
void (*fix)();
double get_wtime() {
struct timespec time;
clock_gettime(CLOCK_MONOTONIC, &time);
#ifndef __AVX__
fix();
#endif
return time.tv_sec + 1E-9*time.tv_nsec;
}
void dispatch() {
fix = fix_sse;
#if defined(__INTEL_COMPILER)
if (_may_i_use_cpu_feature (_FEATURE_AVX)) fix = fix_avx;
#else
#if defined(__GNUC__) && !defined(__clang__)
__builtin_cpu_init();
#endif
if(__builtin_cpu_supports("avx")) fix = fix_avx;
#endif
}
Stepping through code with gdb I see that the first time clock_gettime is called it calls _dl_runtime_resolve_avx(). I believe the problem is in this function based on this comment. This function appears to only be called the first time clock_gettime is called.
With GCC the problem goes away using //__asm__ __volatile__ ( "vzeroupper" : : : ); after the first call with clock_gettime however with Clang (using clang -O2 -fno-vectorize since Clang vectorizes even at -O2) it only goes away using it after every call to clock_gettime.
Here is the code I used to test this (with GCC 6.3 and Clang 3.8)
#include <string.h>
#include <stdio.h>
#include <x86intrin.h>
#include <time.h>
void fix_avx() { __asm__ __volatile__ ( "vzeroupper" : : : ); }
void fix_sse() { }
void (*fix)();
double get_wtime() {
struct timespec time;
clock_gettime(CLOCK_MONOTONIC, &time);
#ifndef __AVX__
fix();
#endif
return time.tv_sec + 1E-9*time.tv_nsec;
}
void dispatch() {
fix = fix_sse;
#if defined(__INTEL_COMPILER)
if (_may_i_use_cpu_feature (_FEATURE_AVX)) fix = fix_avx;
#else
#if defined(__GNUC__) && !defined(__clang__)
__builtin_cpu_init();
#endif
if(__builtin_cpu_supports("avx")) fix = fix_avx;
#endif
}
#define N 1000000
#define R 1000
void mul(double *a, double *b) {
for (int i = 0; i<N; i++) a[i] *= b[i];
}
int main() {
dispatch();
const double mem = 3*sizeof(double)*N*R/1024/1024/1024;
const double maxbw = 34.1;
double *a = (double*)_mm_malloc(sizeof *a * N, 32);
double *b = (double*)_mm_malloc(sizeof *b * N, 32);
//b must be initialized to get the correct bandwidth!!!
memset(a, 1, sizeof *a * N);
memset(b, 1, sizeof *b * N);
double dtime;
//dtime = get_wtime(); // call once to fix GCC
//printf("%f\n", dtime);
//fix = fix_sse;
dtime = -get_wtime();
for(int i=0; i<R; i++) mul(a,b);
dtime += get_wtime();
printf("time %.2f s, %.1f GB/s, efficency %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw);
_mm_free(a), _mm_free(b);
}
If I disable lazy function call resolution with -z now (e.g. clang -O2 -fno-vectorize -z now foo.c) then Clang only needs __asm__ __volatile__ ( "vzeroupper" : : : ); after the first call to clock_gettime just like GCC.
I expected that with -z now I would only need __asm__ __volatile__ ( "vzeroupper" : : : ); right after main() but I still need it after the first call to clock_gettime.
I have a problem with OpenMp. I need to compute Pi with OpenMP and Monte Carlo. I write simple program and i am reading number of threads from command line. Now it is working not stable sometimes 1 thread is faster than 16. Have anyine idea what am i doing wrong?
int main(int argc, char*argv[])
{
int niter, watki;
watki = strtol(argv[1], NULL, 0);
niter = strtol(argv[2], NULL, 0);
intcount=0
int i;
double x, y, z;
double pi;
omp_set_dynamic(0);
unsigned int myseed = omp_get_thread_num();
double start = omp_get_wtime();
omp_set_num_threads(watki);
#pragma omp parallel for private(i,x,y,z) reduction(+:count)
for ( i=0; i<niter; i++) {
x = (double)rand_r(&myseed)/RAND_MAX;
y = (double)rand_r(&myseed)/RAND_MAX;
z = x*x+y*y;
if (z<=1) count++;
}
pi=(double)count/ niter*4;
printf("# of trials= %d, threads %d , estimate of pi is %g \n",niter, watki,pi);
double end = omp_get_wtime();
printf("%f \n", (end - start));
}
I compile it with gcc -fopenmp pi.c -o pi
And run it with ./pi 1 10000
Thanks in advance
You're calling omp_get_thread_num outside of the parallel region, which will always return 0.
Then all your rand_r calls will access the same shared seed, which is probably the source of your problem. You should declar myseed within the loop to make it private to each thread, and to get the correct value from omp_get_thread_num
#pragma omp parallel for private(i,x,y,z) reduction(+:count)
for ( i=0; i<niter; i++) {
int myseed = omp_get_thread_num();
x = (double)rand_r(&myseed)/RAND_MAX;
y = (double)rand_r(&myseed)/RAND_MAX;
z = x*x+y*y;
if (z<=1) count++;
}
I am trying to parallelize this recursive function with openmp:
#include <stdio.h>
#include <omp.h>
void rec(int from, int to){
int len=to-from;
printf("%X %x %X %d\n", from, to, len, omp_get_thread_num());
if (len > 1){
int mid = (from+to)/2;
#pragma omp task
rec(from, mid);
#pragma omp task
rec(mid, to);
}
}
int main(int argc, char *argv[]){
long len=1024;
#pragma omp parallel
#pragma omp single
rec(0, len);
return 0;
}
But when I run it I get segfault:
$g++ -fopenmp -Wall -pedantic -lefence -g -O0 test.cpp && ./a.out
0 400 400 0
0 200 200 1
200 400 200 0
Segmentation fault
When I run it in valgrind it shows no errors. Without -lefence it also work.
I tried all possible combinations of #pragma omp clauses and it is either one-threaded or segfault.
What is wrong?
Thanks a lot.