OpenMP strange behaviour - c

Hello I have the following code, which I compile with gcc (>4.2) with -fopenmp flag:
int main(void)
{
#pragma omp parallel for
int i;
for(i=0;i<4;i++) while(1);
return 0;
}
I get a SIGSEGV on OSX Lion (ver 1.7.3, llvm-gcc 4.2.1) and CentOS 6.2 . What am I doing wrong here? Thanks

Not sure if this is relevant to the compiler version and configuration but while(true){} terminates
More precisely, if you write a loop which
makes no calls to library I/O functions, and
does not access or modify volatile objects, and
performs no synchronization operations (1.10) or atomic operations (Clause 29)
and does not terminate, you have undefined behaviour.
This may end up not applying to your situation, but as C++11 becomes more established, watch out.

Very interesting.
I changed the code a little
so
int main(void)
{
int i;
#pragma omp parallel
{
while(1);
}
return 0;
}
and so
inline void func() {
while (1) ;
}
int main(void)
{
int i;
#pragma omp parallel for
for(i=0;i<8;i++) {
func();
}
return 0;
}
And they both worked OK.

There was a bug in the gcc regarding this issue, I reported it and they will provide a fix. Here is the link: GCC bug

Related

Use openMP only when an argument is passed to the program

Is there a good way to use OpenMP to parallelize a for-loop, only if an -omp argument is passed to the program?
This seems not possible, since #pragma omp parallel for is a preprocessor directive and thus evaluated even before compile time and of course it is only certain if the argument is passed to the program at runtime.
At the moment I am using a very ugly solution to achieve this, which leads to an enormous duplication of code.
if(ompDefined) {
#pragma omp parallel for
for(...)
...
}
else {
for(...)
...
}
I think what you are looking for can be solved using a CPU dispatcher technique.
For benchmarking OpenMP code vs. non-OpenMP code you can create different object files from the same source code like this
//foo.c
#ifdef _OPENMP
double foo_omp() {
#else
double foo() {
#endif
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i<1000000000; i++) sum += i%10;
return sum;
}
Compile like this
gcc -O3 -c foo.c
gcc -O3 -fopenmp -c foo.c -o foo_omp.o
This creates two object files foo.o and foo_omp.o. Then you can call one of these functions like this
//bar.c
#include <stdio.h>
double foo();
double foo_omp();
double (*fp)();
int main(int argc, char *argv[]) {
if(argc>1) {
fp = foo_omp;
}
else {
fp = foo;
}
double sum = fp();
printf("sum %e\n", sum);
}
Compile and link like this
gcc -O3 -fopenmp bar.c foo.o foo_omp.o
Then I time the code like this
time ./a.out -omp
time ./a.out
and the first case takes about 0.4 s and the second case about 1.2 s on my system with 4 cores/8 hardware threads.
Here is a solution which only needs a single source file
#include <stdio.h>
typedef double foo_type();
foo_type foo, foo_omp, *fp;
#ifdef _OPENMP
#define FUNCNAME foo_omp
#else
#define FUNCNAME foo
#endif
double FUNCNAME () {
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i<1000000000; i++) sum += i%10;
return sum;
}
#ifdef _OPENMP
int main(int argc, char *argv[]) {
if(argc>1) {
fp = foo_omp;
}
else {
fp = foo;
}
double sum = fp();
printf("sum %e\n", sum);
}
#endif
Compile like this
gcc -O3 -c foo.c
gcc -O3 -fopenmp foo.c foo.o
You can set the number of threads at run-time by calling omp_set_num_threads:
#include <omp.h>
int main()
{
int threads = 1;
#ifdef _OPENMP
omp_set_num_threads(threads);
#endif
#pragma omp parallel for
for(...)
{
...
}
}
This isn't quite the same as disabling OpenMP, but it will stop it running calculations in parallel. I've found it's always a good idea to set this using a command line switch (you can implement this using GNU getopt or Boost.ProgramOptions). This allows you to easily run single-threaded and multi-threaded tests on the same code.
As Vladimir F pointed out in the comments, you can also set the number of threads by setting the environment variable OMP_NUM_THREADS before executing your program:
gcc -Wall -Werror -pedantic -O3 -fopenmp -o test test.c
OMP_NUM_THREADS=1
./test
unset OMP_NUM_THREADS
Finally, you can disable OpenMP at compile-time by not providing GCC with the -fopenmp option. However, you will need to put preprocessor guards around any lines in your code that require OpenMP to be enabled (see above). If you want to use some functions included in the OpenMP library without actually enabling the OpenMP pragmas you can simply link against the OpenMP library by replacing the -fopenmp option with -lgomp.
One solution would be to use the preprocessor to ignore the pragma statement if you do not pass an additional flag to the compiler.
For example in your code you might have:
#ifdef MP_ENABLED
#pragma omp parallel for
#endif
for(...)
...
and then when you compile you can pass a flag to the compiler to define the MP_ENABLED macro. In the case of GCC (and Clang) you would pass -DMP_ENABLED.
You then might compile with gcc as
gcc SOME_SOURCE.c -I SOME_INCLUDE.h -lomp -DMP_ENABLED -o SOME_OUTPUT
then when you want to disable the parallelism you can make a minor tweek to the compile command by dropping -DMP_ENABLED.
gcc SOME_SOURCE.c -I SOME_INCLUDE.h -lomp -DMP_ENABLED -o SOME_OUTPUT
This causes the macro to be undefined which leads to the preprocessor ignoring the pragma.
You could also use a similar solution using ifndef instead depending on whether you consider the parallel behavior the default or not.
Edit: As noted by some comments, inclusion of OMP lib defines some macros such as _OPENMP which you could use in place of your own user-defined macros. That looks to be a superior solution, but the difference in effort is reasonably small.

OpenMP and Thread Local Storage identifier with icc

This is a simple test code:
#include <stdlib.h>
__thread int a = 0;
int main() {
#pragma omp parallel default(none)
{
a = 1;
}
return 0;
}
gcc compiles this without any problems with -fopenmp, but icc (ICC) 12.0.2 20110112 with -openmp complains with
test.c(7): error: "a" must be specified in a variable list at enclosing OpenMP parallel pragma
#pragma omp parallel default(none)
I have no clue which paradigm (i.e. shared, private, threadprivate) applies to this type of variables. Which one is the correct one to use?
I get the expected behaviour when calling a function that accesses that thread local variable, but I have trouble accessing it from within an explicit parallel section.
Edit:
My best solution so far is to return a pointer to the variable through a function
static inline int * get_a() { return &a; }
__thread is roughly analogous to the effect that the threadprivate OpenMP directive has. To a great extent (read as when no C++ objects are involved), both are often implemented using the same underlying compiler mechanism and therefore are compatible but this is not guaranteed to always work. Of course, the real world is far from ideal and we have to sometimes sacrifice portability for just having things working within the given development constraints.
threadprivate is a directive and not a clause, therefore you have to do something like:
#include "header_providing_a.h"
#pragma omp threadprivate(a)
void parallel_using_a()
{
#pragma omp parallel default(none) ...
... use 'a' here
}
GCC (at least version 4.7.1) treats __thread as implicit threadprivate declaration and you don't have to do anything.

How to do an atomic increment and fetch in C?

I'm looking for a way to atomically increment a short, and then return that value. I need to do this both in kernel mode and in user mode, so it's in C, under Linux, on Intel 32bit architecture. Unfortunately, due to speed requirements, a mutex lock isn't going to be a good option.
Is there any other way to do this? At this point, it seems like the only option available is to inline some assembly. If that's the case, could someone point me towards the appropriate instructions?
GCC __atomic_* built-ins
As of GCC 4.8, __sync built-ins have been deprecated in favor of the __atomic built-ins: https://gcc.gnu.org/onlinedocs/gcc-4.8.2/gcc/_005f_005fatomic-Builtins.html
They implement the C++ memory model, and std::atomic uses them internally.
The following POSIX threads example fails consistently with ++ on x86-64, and always works with _atomic_fetch_add.
main.c
#include <assert.h>
#include <pthread.h>
#include <stdlib.h>
enum CONSTANTS {
NUM_THREADS = 1000,
NUM_ITERS = 1000
};
int global = 0;
void* main_thread(void *arg) {
int i;
for (i = 0; i < NUM_ITERS; ++i) {
__atomic_fetch_add(&global, 1, __ATOMIC_SEQ_CST);
/* This fails consistently. */
/*global++*/;
}
return NULL;
}
int main(void) {
int i;
pthread_t threads[NUM_THREADS];
for (i = 0; i < NUM_THREADS; ++i)
pthread_create(&threads[i], NULL, main_thread, NULL);
for (i = 0; i < NUM_THREADS; ++i)
pthread_join(threads[i], NULL);
assert(global == NUM_THREADS * NUM_ITERS);
return EXIT_SUCCESS;
}
Compile and run:
gcc -std=c99 -Wall -Wextra -pedantic -o main.out ./main.c -pthread
./main.out
Disassembly analysis at: How do I start threads in plain C?
Tested in Ubuntu 18.10, GCC 8.2.0, glibc 2.28.
C11 _Atomic
In 5.1, the above code works with:
_Atomic int global = 0;
global++;
And C11 threads.h was added in glibc 2.28, which allows you to create threads in pure ANSI C without POSIX, minimal runnable example: How do I start threads in plain C?
GCC supports atomic operations:
gcc atomics

C Function alignment in GCC

I am trying to byte-align a function to 16-byte boundary using the 'aligned(16)' attribute. I did the following: void __attribute__((aligned(16))) function() { }
(Source: http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html)
But when I compile (gcc foo.c ; no makefiles or linker scripts used), I get the following error:
FOO.c:99: error: alignment may not be specified for 'function'
I tried aligning to 4,8,32, etc as well but the error remains the same.
I need this to align an Interrupt Service Routine for a powerpc-based processor. What is the correct way of doing so ?
Why don't you just pass the -falign-functions=16 to gcc when compiling?
Adapting from my answer on this GCC question, you might try using #pragma directives, like so:
#pragma GCC push_options
#pragma GCC optimize ("align-functions=16")
//add 5 to each element of the int array.
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
#pragma GCC pop_options
The #pragma push_options and pop_options macros are used to control the scope of the optimize pragma's effect. More details about these macros can be found in the GCC docs.
Alternately, if you prefer GCC's attribute syntax, you should be able to do something like:
//add 5 to each element of the int array.
__attribute__((optimize("align-functions=16")))
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
You are probably using an older version of gcc that does not support that attribute. The documentation link you provided is for the "current development" of gcc. Looking through the various releases, the attribute only appears in the documentation for gcc 4.3 and beyond.

Operations from atomic.h seem to be non-atomic

The following code produces random values for both n and v. It's not surprising that n is random without being properly protected. But it is supposed that v should finally be 0. Is there anything wrong in my code? Or could anyone explain this for me? Thanks.
I'm working on a 4-core server of x86 architecture. The uname is as follows.
Linux 2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:00:54 EDT 2005 x86_64 x86_64 x86_64 GNU/Linux
#include <stdio.h>
#include <pthread.h>
#include <asm-x86_64/atomic.h>
int n = 0;
atomic_t v;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
#define LOOP 10000
void* foo(void *p)
{
int i = 0;
for(i = 0; i < LOOP; i++) {
// pthread_mutex_lock(&mutex);
++n;
--n;
atomic_inc(&v);
atomic_dec(&v);
// pthread_mutex_unlock(&mutex);
}
return NULL;
}
#define COUNT 50
int main(int argc, char **argv)
{
int i;
pthread_t pids[COUNT];
pthread_attr_t attr;
pthread_attr_init(&attr);
atomic_set(&v, 0);
for(i = 0; i < COUNT; i++) {
pthread_create(&pids[i], &attr, foo, NULL);
}
for(i = 0; i < COUNT; i++) {
pthread_join(pids[i], NULL);
}
printf("%d\n", n);
printf("%d\n", v);
return 0;
}
You should use gcc built-ins instead (see. this) This works fine, and also works with icc.
int a;
__sync_fetch_and_add(&a, 1); // atomic a++
Note that you should be aware of the cache consistency issues when you modify variables without locking.
This old post implies that
It's not obvious that you're supposed to include this kernel header in userspace programs
It's been known to fail to provide atomicity for userspace programs.
So ... Perhaps that's the reason for the problems you're seeing?
Can we get a look at the assembler output of the code (gcc -E, I think). Even thought the uname indicates it's SMP-aware, that doesn't necessarily mean it was compiled with CONFIG_SMP.
Without that, the assembler code output does not have the lock prefix and you can find your cores interfering with one another.
But I would be using the pthread functions anyway since they're portable across more platforms.
Linux kernel atomic.h is not usable from userland and never was. On x86, some of it might work, because x86 is rather synchronization-friendly architecture, but on some platforms it relies heavily on being able to do privileged operations (older arm) or at least on being able to disable preemption (older arm and sparc at least), which is not the case in userland!

Resources