I am trying to use OpenMP on a code doing some complex algebra using FFTW operations. Each thread needs to work separately using its own work arrays. Therefore I followed this answer ( Reusable private dynamically allocated arrays in OpenMP ) and tried to use the threadprivate functionality. The FFTW plans and workspace variables need to be used many times in the code in multiple parallel segments.
The following is a test code. However, it gives all kinds of errors (segmentation faults, etc.) at the line p1=fftw_plan_ .... Any idea what I am doing wrong ? This seems to work well just with arrays but is not working for the FFTW plan. Can an FFTW plan be shared by threads but working on their separate data segments inpd, outc1, outc2 ?
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
...
#include <math.h>
#include <complex.h>
#include <omp.h>
#include <fftw3.h>
int main(int argc, char *argv[])
{
int npts2, id;
static double *inpd;
static fftw_complex *outc1, *outc2;
static fftw_plan p1, p2;
static int npts2stat;
#pragma omp threadprivate(inpd,outc1,outc2,p1,p2,npts2stat)
npts2=1000;
#pragma omp parallel private(id) shared(npts2)
{
id=omp_get_thread_num();
printf("Thread %d: Allocating threadprivate memory \n",id);
npts2stat=npts2;
printf("step1 \n");
inpd=malloc(sizeof(double)*npts2stat);
outc1=fftw_malloc(sizeof(fftw_complex)*npts2stat);
outc2=fftw_malloc(sizeof(fftw_complex)*npts2stat);
printf("step2 \n");
// CODE COMPILES FINE BUT STOPS HERE WITH ERROR
p1=fftw_plan_dft_r2c_1d(npts2stat,inpd,outc1,FFTW_ESTIMATE);
p2=fftw_plan_dft_1d(npts2stat,outc1,outc2,FFTW_BACKWARD,FFTW_ESTIMATE);
printf("step3 \n");
}
// multiple omp parallel segments with different threads doing some calculations on complex numbers using FFTW
#pragma omp parallel private(id)
{
id=omp_get_thread_num();
printf("Thread %d: Deallocating threadprivate memory \n",id);
free(inpd);
fftw_free(outc1);
fftw_free(outc2);
fftw_destroy_plan(p1);
fftw_destroy_plan(p2);
}
}
EDIT1: I seem to have solved the segmentation fault issue below. However if you have a better way of doing this, that will be helpful. For example, I don't know if a single plan for all threads is sufficient because it is acting on different data inpd private to different threads.
Sorry the section on Thread Safety in FFTW manual makes it pretty clear that FFTW planner should be called only one thread at a time. This seems to have solved the segmentation fault issue.
#pragma omp critical
{
p1=fftw_plan_dft_r2c_1d(npts2stat,inpd,outc1,FFTW_ESTIMATE);
p2=fftw_plan_dft_1d(npts2stat,outc1,outc2,FFTW_BACKWARD,FFTW_ESTIMATE);
}
Related
The following snippet is from one of the functions of my code:
static int i;
#pragma omp parallel for default(shared) private(i) schedule(static,1)
for (i=0; i<ttm_ic_last; i++)
{
static int ni, ni1, ni2;
static double ni_ratio;
static double temp_e, temp_l;
...
}
It's odd that when I comment the line starting with #pragma it works properly, otherwise the loop doesn't touch at least some of the intended values of i. (I'm not sure if 'touch' is the correct verb here.)
I'm using a workstation with
gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4)
I wonder what the cause of this error can be.
(Answer by Stefan)
Don't use static variables when OpenMP threads are involved.
The thing is; with statics, they have a shared memory space. So they will likely to interfere with each other across the threads. Your parallel loops are all looking inside the same box.
From C I am calling a piece of Fortran code that then calls some other C code. In order to call the last bit of C code, I need to have two global pointers to an EarthModel struct and a SurveyGeometry struct that I have defined. I have tried to parallelize the for loop below in calcGreen.c, but have been unsuccessful with more than 1 thread (the program segfaults).
I need each thread to have its own pointer to different EarthModels and SurveyGeometrys while keeping the global definition. I tried using the omp threadprivate directive to give each thread its own struct pointer which it can allocate and free and maintain the global definition on the thread level. I have also read that the default stack is 2M for created threads, so I've tried giving the threads more memory by setting the environment variable with export OMP_STACKSIZE=512M (and higher), but the segfault persists.
shared.h
extern EarthModel *g_em;
extern SurveyGeometry *g_sg;
#pragma omp thradprivate(g_em, g_sg)
util.h
#include "shared.h"
EarthModel *g_em;
SurveyGeometry *g_sg;
calcGreen.c
#include "util.h"
...
omp_set_num_threads(2);
#pragma omp parallel for schedule(dynamic,1)
for(int ii=0; ii<nseg; ++ii){
for(int jj=0; jj<nseg; ++jj){
...
// code to allocate and initialize g_sg and g_em
g_sg = initSG();
g_em = initEM();
// code to pass through to Fortran and execute C function on g_sg and g_em
// code to free g_sg and g_em
freeSG(g_sg);
freeEM(g_em);
...
}
}
...
EDIT: Alternatively, is there a way of getting the structs g_sg and g_em from the first C function where there are allocated and set to the C function that Fortran calls in a thread safe way without using global variables?
Not entirely sure why this worked, but spelling "threadprivate" correctly AND moving the #pragma omp threadprivate directive to util.h seems to have done the trick. The first is unsurprising, but the second isn't intuitive to me. Thank you for the help.
If Harald's comment does not already solve the problem, some suggestions:
1) If it is allowed to change the source code of calcGreen.c and if each thread does not use the pointers before they are (re-?)allocated and (re-?)initialized by calling initSG() and initEM(), I would declare them as local variables inside the inner for-loop.
2) Are the implementations of initSG(), initEM(), freeSG() and freeEM() thread-safe and reentrant?
This is a simple test code:
#include <stdlib.h>
__thread int a = 0;
int main() {
#pragma omp parallel default(none)
{
a = 1;
}
return 0;
}
gcc compiles this without any problems with -fopenmp, but icc (ICC) 12.0.2 20110112 with -openmp complains with
test.c(7): error: "a" must be specified in a variable list at enclosing OpenMP parallel pragma
#pragma omp parallel default(none)
I have no clue which paradigm (i.e. shared, private, threadprivate) applies to this type of variables. Which one is the correct one to use?
I get the expected behaviour when calling a function that accesses that thread local variable, but I have trouble accessing it from within an explicit parallel section.
Edit:
My best solution so far is to return a pointer to the variable through a function
static inline int * get_a() { return &a; }
__thread is roughly analogous to the effect that the threadprivate OpenMP directive has. To a great extent (read as when no C++ objects are involved), both are often implemented using the same underlying compiler mechanism and therefore are compatible but this is not guaranteed to always work. Of course, the real world is far from ideal and we have to sometimes sacrifice portability for just having things working within the given development constraints.
threadprivate is a directive and not a clause, therefore you have to do something like:
#include "header_providing_a.h"
#pragma omp threadprivate(a)
void parallel_using_a()
{
#pragma omp parallel default(none) ...
... use 'a' here
}
GCC (at least version 4.7.1) treats __thread as implicit threadprivate declaration and you don't have to do anything.
I have a single block enclosed in a sections block like this
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main (int argc, char *argv[])
{
int nthreads, tid;
/* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(tid)
{
#pragma omp sections
{
#pragma omp section
{
printf("First section %d \n" , tid);
}
#pragma omp section
{
#pragma omp single
{
printf("Second Section block %d \n" , tid);
}
}
}
} /* All threads join master thread and disband */
printf("Outside parallel block \n");
}
When i compile this code the compiler gives the following warning
work-sharing region may not be closely nested inside of work-sharing, critical, ordered or master region
Why is that ?
It gives you this warning because you have an openmp single region nested inside an openmp sections region without an openmp parallel region nested between them.
This is known as a closely nested region.
In C, the worksharing constructs are for, sections, and single.
For further information see the OpenMP Specification or see Intel's Documentation on Improper nesting of OpenMP* constructs.
In order to have the code compile cleanly, try replacing your #pragma omp sections with #pragma omp parallel sections
or enclosing #pragma omp sections with #pragma omp parallel.
See Guide into OpenMP: Easy multithreading programming for C++ for more information and examples.
I'm just getting started experimenting adding OpenMP to some SSE code.
My first test program SOMETIMES crashes in _mm_set_ps, but works when I set the if (0).
It looks so simple I must be missing something obvious.
I'm compiling with gcc -fopenmp -g -march=core2 -pthreads
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
int main()
{
#pragma omp parallel if (1)
{
#pragma omp sections
{
#pragma omp section
{
__m128 x1 = _mm_set_ps ( 1.1f, 2.1f, 3.1f, 4.1f );
}
#pragma omp section
{
__m128 x2 = _mm_set_ps ( 1.2f, 2.2f, 3.2f, 4.2f );
}
} // end omp sections
} //end omp parallel
return 0;
}
This is a bug in the openMP implementation. I was having the same problem in gcc on Windows (MinGW). -mstackrealign command line option solved my problem. This adds an instruction to the prolog of every function to realign the stack at the 16-byte boundary. I didn't notice any performance penalty. You can also try to add __attribute__ ((force_align_arg_pointer)) to a function declaration, which should do the same, but only for a specific function. You might have to put the SSE code in a separate function that you then call from the function with #pragma omp, so that the stack has a chance to be realigned.
I stopped having the problem when I moved onto compiling for a 64-bit target (MinGW64, such as TDM GCC build).
I am playing with AVX instructions which require a 32-byte alignment, but GCC doesn't support that for windows at all. This forced me to fix the produced assembly code using a python script, but it works.
I smell unaligned memory access. Its the only way code like that could explode(assuming that is the only code there). For that to happen the XMM registers wouldn't be used but rather stack memory, which is only aligned to 4 bytes, my guess is the omp code is messing up the alignment of the stack.