OpenMP parallel 'for' doesn't work properly - c

The following snippet is from one of the functions of my code:
static int i;
#pragma omp parallel for default(shared) private(i) schedule(static,1)
for (i=0; i<ttm_ic_last; i++)
{
static int ni, ni1, ni2;
static double ni_ratio;
static double temp_e, temp_l;
...
}
It's odd that when I comment the line starting with #pragma it works properly, otherwise the loop doesn't touch at least some of the intended values of i. (I'm not sure if 'touch' is the correct verb here.)
I'm using a workstation with
gcc (GCC) 4.4.6 20120305 (Red Hat 4.4.6-4)
I wonder what the cause of this error can be.

(Answer by Stefan)
Don't use static variables when OpenMP threads are involved.
The thing is; with statics, they have a shared memory space. So they will likely to interfere with each other across the threads. Your parallel loops are all looking inside the same box.

Related

Is there any difference between #pragma unroll(0) and #pragma unroll(1)?

I read the document about the loop unrolling.
It explains that if you set unrolling factor as 1, then the program will work like with #pragma nounrolling.
However, that documents does not include #pragma unroll(0) case..
Since the range of n is 0 to 255, I'm just wondering out of curiosity there is any difference between #pragma unroll(0) and #pragma unroll(1) cases.
I'm using C with icc compiler.
From the Intel documentation:
The compiler generates correct code by comparing n and the loop count.
Based on that, I would make an assumption there is no difference between #pragma unroll(0) and #pragma unroll(1) as the the code generated would be equivalent.

Force the compiler duplicate c inline function code which been called from inside a loop [duplicate]

How can I tell GCC to unroll a particular loop?
I have used the CUDA SDK where loops can be unrolled manually using #pragma unroll. Is there a similar feature for gcc? I googled a bit but could not find anything.
GCC gives you a few different ways of handling this:
Use #pragma directives, like #pragma GCC optimize ("string"...), as seen in the GCC docs. Note that the pragma makes the optimizations global for the remaining functions. If you used #pragma push_options and pop_options macros cleverly, you could probably define this around just one function like so:
#pragma GCC push_options
#pragma GCC optimize ("unroll-loops")
//add 5 to each element of the int array.
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
#pragma GCC pop_options
Annotate individual functions with GCC's attribute syntax: check the GCC function attribute docs for a more detailed dissertation on the subject. An example:
//add 5 to each element of the int array.
__attribute__((optimize("unroll-loops")))
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
Note: I'm not sure how good GCC is at unrolling reverse-iterated loops (I did it to get Markdown to play nice with my code). The examples should compile fine, though.
GCC 8 has gained a new pragma that allows you to control how loop unrolling is done:
#pragma GCC unroll n
Quoting from the manual:
You can use this pragma to control how many times a loop should be
unrolled. It must be placed immediately before a for, while or do loop
or a #pragma GCC ivdep, and applies only to the loop that follows. n
is an integer constant expression specifying the unrolling factor. The
values of 0 and 1 block any unrolling of the loop.
-funroll-loops might be helpful (though it turns on loop-unrolling globally, not per-loop). I'm not sure whether there's a #pragma to do the same...

OpenMP and Thread Local Storage identifier with icc

This is a simple test code:
#include <stdlib.h>
__thread int a = 0;
int main() {
#pragma omp parallel default(none)
{
a = 1;
}
return 0;
}
gcc compiles this without any problems with -fopenmp, but icc (ICC) 12.0.2 20110112 with -openmp complains with
test.c(7): error: "a" must be specified in a variable list at enclosing OpenMP parallel pragma
#pragma omp parallel default(none)
I have no clue which paradigm (i.e. shared, private, threadprivate) applies to this type of variables. Which one is the correct one to use?
I get the expected behaviour when calling a function that accesses that thread local variable, but I have trouble accessing it from within an explicit parallel section.
Edit:
My best solution so far is to return a pointer to the variable through a function
static inline int * get_a() { return &a; }
__thread is roughly analogous to the effect that the threadprivate OpenMP directive has. To a great extent (read as when no C++ objects are involved), both are often implemented using the same underlying compiler mechanism and therefore are compatible but this is not guaranteed to always work. Of course, the real world is far from ideal and we have to sometimes sacrifice portability for just having things working within the given development constraints.
threadprivate is a directive and not a clause, therefore you have to do something like:
#include "header_providing_a.h"
#pragma omp threadprivate(a)
void parallel_using_a()
{
#pragma omp parallel default(none) ...
... use 'a' here
}
GCC (at least version 4.7.1) treats __thread as implicit threadprivate declaration and you don't have to do anything.

Segmentation fault using OpenMp and SSE

I'm just getting started experimenting adding OpenMP to some SSE code.
My first test program SOMETIMES crashes in _mm_set_ps, but works when I set the if (0).
It looks so simple I must be missing something obvious.
I'm compiling with gcc -fopenmp -g -march=core2 -pthreads
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
int main()
{
#pragma omp parallel if (1)
{
#pragma omp sections
{
#pragma omp section
{
__m128 x1 = _mm_set_ps ( 1.1f, 2.1f, 3.1f, 4.1f );
}
#pragma omp section
{
__m128 x2 = _mm_set_ps ( 1.2f, 2.2f, 3.2f, 4.2f );
}
} // end omp sections
} //end omp parallel
return 0;
}
This is a bug in the openMP implementation. I was having the same problem in gcc on Windows (MinGW). -mstackrealign command line option solved my problem. This adds an instruction to the prolog of every function to realign the stack at the 16-byte boundary. I didn't notice any performance penalty. You can also try to add __attribute__ ((force_align_arg_pointer)) to a function declaration, which should do the same, but only for a specific function. You might have to put the SSE code in a separate function that you then call from the function with #pragma omp, so that the stack has a chance to be realigned.
I stopped having the problem when I moved onto compiling for a 64-bit target (MinGW64, such as TDM GCC build).
I am playing with AVX instructions which require a 32-byte alignment, but GCC doesn't support that for windows at all. This forced me to fix the produced assembly code using a python script, but it works.
I smell unaligned memory access. Its the only way code like that could explode(assuming that is the only code there). For that to happen the XMM registers wouldn't be used but rather stack memory, which is only aligned to 4 bytes, my guess is the omp code is messing up the alignment of the stack.

Tell gcc to specifically unroll a loop

How can I tell GCC to unroll a particular loop?
I have used the CUDA SDK where loops can be unrolled manually using #pragma unroll. Is there a similar feature for gcc? I googled a bit but could not find anything.
GCC gives you a few different ways of handling this:
Use #pragma directives, like #pragma GCC optimize ("string"...), as seen in the GCC docs. Note that the pragma makes the optimizations global for the remaining functions. If you used #pragma push_options and pop_options macros cleverly, you could probably define this around just one function like so:
#pragma GCC push_options
#pragma GCC optimize ("unroll-loops")
//add 5 to each element of the int array.
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
#pragma GCC pop_options
Annotate individual functions with GCC's attribute syntax: check the GCC function attribute docs for a more detailed dissertation on the subject. An example:
//add 5 to each element of the int array.
__attribute__((optimize("unroll-loops")))
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
Note: I'm not sure how good GCC is at unrolling reverse-iterated loops (I did it to get Markdown to play nice with my code). The examples should compile fine, though.
GCC 8 has gained a new pragma that allows you to control how loop unrolling is done:
#pragma GCC unroll n
Quoting from the manual:
You can use this pragma to control how many times a loop should be
unrolled. It must be placed immediately before a for, while or do loop
or a #pragma GCC ivdep, and applies only to the loop that follows. n
is an integer constant expression specifying the unrolling factor. The
values of 0 and 1 block any unrolling of the loop.
-funroll-loops might be helpful (though it turns on loop-unrolling globally, not per-loop). I'm not sure whether there's a #pragma to do the same...

Resources