This is a simple test code:
#include <stdlib.h>
__thread int a = 0;
int main() {
#pragma omp parallel default(none)
{
a = 1;
}
return 0;
}
gcc compiles this without any problems with -fopenmp, but icc (ICC) 12.0.2 20110112 with -openmp complains with
test.c(7): error: "a" must be specified in a variable list at enclosing OpenMP parallel pragma
#pragma omp parallel default(none)
I have no clue which paradigm (i.e. shared, private, threadprivate) applies to this type of variables. Which one is the correct one to use?
I get the expected behaviour when calling a function that accesses that thread local variable, but I have trouble accessing it from within an explicit parallel section.
Edit:
My best solution so far is to return a pointer to the variable through a function
static inline int * get_a() { return &a; }
__thread is roughly analogous to the effect that the threadprivate OpenMP directive has. To a great extent (read as when no C++ objects are involved), both are often implemented using the same underlying compiler mechanism and therefore are compatible but this is not guaranteed to always work. Of course, the real world is far from ideal and we have to sometimes sacrifice portability for just having things working within the given development constraints.
threadprivate is a directive and not a clause, therefore you have to do something like:
#include "header_providing_a.h"
#pragma omp threadprivate(a)
void parallel_using_a()
{
#pragma omp parallel default(none) ...
... use 'a' here
}
GCC (at least version 4.7.1) treats __thread as implicit threadprivate declaration and you don't have to do anything.
Related
The following snippet is from one of the functions of my code:
static int i;
#pragma omp parallel for default(shared) private(i) schedule(static,1)
for (i=0; i<ttm_ic_last; i++)
{
static int ni, ni1, ni2;
static double ni_ratio;
static double temp_e, temp_l;
...
}
It's odd that when I comment the line starting with #pragma it works properly, otherwise the loop doesn't touch at least some of the intended values of i. (I'm not sure if 'touch' is the correct verb here.)
I'm using a workstation with
gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4)
I wonder what the cause of this error can be.
(Answer by Stefan)
Don't use static variables when OpenMP threads are involved.
The thing is; with statics, they have a shared memory space. So they will likely to interfere with each other across the threads. Your parallel loops are all looking inside the same box.
From C I am calling a piece of Fortran code that then calls some other C code. In order to call the last bit of C code, I need to have two global pointers to an EarthModel struct and a SurveyGeometry struct that I have defined. I have tried to parallelize the for loop below in calcGreen.c, but have been unsuccessful with more than 1 thread (the program segfaults).
I need each thread to have its own pointer to different EarthModels and SurveyGeometrys while keeping the global definition. I tried using the omp threadprivate directive to give each thread its own struct pointer which it can allocate and free and maintain the global definition on the thread level. I have also read that the default stack is 2M for created threads, so I've tried giving the threads more memory by setting the environment variable with export OMP_STACKSIZE=512M (and higher), but the segfault persists.
shared.h
extern EarthModel *g_em;
extern SurveyGeometry *g_sg;
#pragma omp thradprivate(g_em, g_sg)
util.h
#include "shared.h"
EarthModel *g_em;
SurveyGeometry *g_sg;
calcGreen.c
#include "util.h"
...
omp_set_num_threads(2);
#pragma omp parallel for schedule(dynamic,1)
for(int ii=0; ii<nseg; ++ii){
for(int jj=0; jj<nseg; ++jj){
...
// code to allocate and initialize g_sg and g_em
g_sg = initSG();
g_em = initEM();
// code to pass through to Fortran and execute C function on g_sg and g_em
// code to free g_sg and g_em
freeSG(g_sg);
freeEM(g_em);
...
}
}
...
EDIT: Alternatively, is there a way of getting the structs g_sg and g_em from the first C function where there are allocated and set to the C function that Fortran calls in a thread safe way without using global variables?
Not entirely sure why this worked, but spelling "threadprivate" correctly AND moving the #pragma omp threadprivate directive to util.h seems to have done the trick. The first is unsurprising, but the second isn't intuitive to me. Thank you for the help.
If Harald's comment does not already solve the problem, some suggestions:
1) If it is allowed to change the source code of calcGreen.c and if each thread does not use the pointers before they are (re-?)allocated and (re-?)initialized by calling initSG() and initEM(), I would declare them as local variables inside the inner for-loop.
2) Are the implementations of initSG(), initEM(), freeSG() and freeEM() thread-safe and reentrant?
How can I tell GCC to unroll a particular loop?
I have used the CUDA SDK where loops can be unrolled manually using #pragma unroll. Is there a similar feature for gcc? I googled a bit but could not find anything.
GCC gives you a few different ways of handling this:
Use #pragma directives, like #pragma GCC optimize ("string"...), as seen in the GCC docs. Note that the pragma makes the optimizations global for the remaining functions. If you used #pragma push_options and pop_options macros cleverly, you could probably define this around just one function like so:
#pragma GCC push_options
#pragma GCC optimize ("unroll-loops")
//add 5 to each element of the int array.
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
#pragma GCC pop_options
Annotate individual functions with GCC's attribute syntax: check the GCC function attribute docs for a more detailed dissertation on the subject. An example:
//add 5 to each element of the int array.
__attribute__((optimize("unroll-loops")))
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
Note: I'm not sure how good GCC is at unrolling reverse-iterated loops (I did it to get Markdown to play nice with my code). The examples should compile fine, though.
GCC 8 has gained a new pragma that allows you to control how loop unrolling is done:
#pragma GCC unroll n
Quoting from the manual:
You can use this pragma to control how many times a loop should be
unrolled. It must be placed immediately before a for, while or do loop
or a #pragma GCC ivdep, and applies only to the loop that follows. n
is an integer constant expression specifying the unrolling factor. The
values of 0 and 1 block any unrolling of the loop.
-funroll-loops might be helpful (though it turns on loop-unrolling globally, not per-loop). I'm not sure whether there's a #pragma to do the same...
I'm just getting started experimenting adding OpenMP to some SSE code.
My first test program SOMETIMES crashes in _mm_set_ps, but works when I set the if (0).
It looks so simple I must be missing something obvious.
I'm compiling with gcc -fopenmp -g -march=core2 -pthreads
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
int main()
{
#pragma omp parallel if (1)
{
#pragma omp sections
{
#pragma omp section
{
__m128 x1 = _mm_set_ps ( 1.1f, 2.1f, 3.1f, 4.1f );
}
#pragma omp section
{
__m128 x2 = _mm_set_ps ( 1.2f, 2.2f, 3.2f, 4.2f );
}
} // end omp sections
} //end omp parallel
return 0;
}
This is a bug in the openMP implementation. I was having the same problem in gcc on Windows (MinGW). -mstackrealign command line option solved my problem. This adds an instruction to the prolog of every function to realign the stack at the 16-byte boundary. I didn't notice any performance penalty. You can also try to add __attribute__ ((force_align_arg_pointer)) to a function declaration, which should do the same, but only for a specific function. You might have to put the SSE code in a separate function that you then call from the function with #pragma omp, so that the stack has a chance to be realigned.
I stopped having the problem when I moved onto compiling for a 64-bit target (MinGW64, such as TDM GCC build).
I am playing with AVX instructions which require a 32-byte alignment, but GCC doesn't support that for windows at all. This forced me to fix the produced assembly code using a python script, but it works.
I smell unaligned memory access. Its the only way code like that could explode(assuming that is the only code there). For that to happen the XMM registers wouldn't be used but rather stack memory, which is only aligned to 4 bytes, my guess is the omp code is messing up the alignment of the stack.
How can I tell GCC to unroll a particular loop?
I have used the CUDA SDK where loops can be unrolled manually using #pragma unroll. Is there a similar feature for gcc? I googled a bit but could not find anything.
GCC gives you a few different ways of handling this:
Use #pragma directives, like #pragma GCC optimize ("string"...), as seen in the GCC docs. Note that the pragma makes the optimizations global for the remaining functions. If you used #pragma push_options and pop_options macros cleverly, you could probably define this around just one function like so:
#pragma GCC push_options
#pragma GCC optimize ("unroll-loops")
//add 5 to each element of the int array.
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
#pragma GCC pop_options
Annotate individual functions with GCC's attribute syntax: check the GCC function attribute docs for a more detailed dissertation on the subject. An example:
//add 5 to each element of the int array.
__attribute__((optimize("unroll-loops")))
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
Note: I'm not sure how good GCC is at unrolling reverse-iterated loops (I did it to get Markdown to play nice with my code). The examples should compile fine, though.
GCC 8 has gained a new pragma that allows you to control how loop unrolling is done:
#pragma GCC unroll n
Quoting from the manual:
You can use this pragma to control how many times a loop should be
unrolled. It must be placed immediately before a for, while or do loop
or a #pragma GCC ivdep, and applies only to the loop that follows. n
is an integer constant expression specifying the unrolling factor. The
values of 0 and 1 block any unrolling of the loop.
-funroll-loops might be helpful (though it turns on loop-unrolling globally, not per-loop). I'm not sure whether there's a #pragma to do the same...