Explicit iteration counter privatisation with OMP parallel for - c

I have a few questions about parallelisation using OMP.
Say I have a program, within which there is a nested for loop. From my understanding of the directive #pragma omp parallel for, the outer iteration counter is automatically privatised. Is the same true for the inner iteration counter? This appears to be the case, as the outputs are identical whether I state it explicitly or not.
Is it necessary(/safer) to explicitly privatise iteration counters for for loops within a parallel for block?
I am compiling with GCC - I found that I had some unhelpful crosstalk between threads when using GCC 5.4.0, but not when using GCC 7.5.0. To resolve this, I added private(foo, bar) to the directive, but I am curious as to why it works without this statement for GCC 7.5.0. Does the GCC (7.5.0) automatically identify race conditions/crosstalk and privatise things it thinks should be private?
Other than allocating a few additional memory addresses, is there any significant overhead cost in privatising variables? I think likely 'yes, but (in my case) negligible'. Target audience for code will be using systems with ~10s-100s of cores
Toy example, which finds maximum values in chunks along an array:
find_max(double *inArr, double *outArr, int64_t nSamps, int64_t nCells, int64_t threads) {
double maxVal, curVal;
int64_t t, cell;
#pragma omp parallel for private(maxVal, curVal) num_threads(threads)
for (t=0; t<nSamps; t++) {
maxVal = inArr[t];
for (cell=1; cell<nCells; cell++) {
curVal = inArr[cell * nSamps + t];
if (curVal > maxVal) {
maxVal = curVal;
}
}
outArr[t] = maxVal;
}
}
I am building this as an extension module for a Python library - the call to gcc is:
gcc -pthread -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -Wall -Wstrict-prototypes -c src.c -o src.o -fopenmp -fPIC -Ofast

Related

pragma omp for simd does not generate vector instructions in GCC

Short: Does the pragma omp for simd OpenMP directive generate code that uses SIMD registers?
Longer:
As stated in the OpenMP documentation "The worksharing-loop SIMD construct specifies that the iterations of one or more associated loops will be distributed across threads that already exist [..] using SIMD instructions". From this statement, I would expect the following code (simd.c) to use XMM, YMM or ZMM registers when compiling running gcc simd.c -o simd -fopenmp but it does not.
#include <stdio.h>
#define N 100
int main() {
int x[N];
int y[N];
int z[N];
int i;
int sum;
for(i=0; i < N; i++) {
x[i] = i;
y[i] = i;
}
#pragma omp parallel
{
#pragma omp for simd
for(i=0; i < N; i++) {
z[i] = x[i] + y[i];
}
#pragma omp for simd reduction(+:sum)
for(i=0; i < N; i++) {
sum += x[i];
}
}
printf("%d %d\n",z[N/2], sum);
return 0;
}
When checking the assembler generated running gcc simd.c -S -fopenmp no SIMD register is used.
I can use SIMD registers without OpenMP using the option -O3 because according to GCC documentation
it includes the -ftree-vectorize flag.
XMM registers: gcc simd.c -o simd -O3
YMM registers: gcc simd.c -o simd -O3 -march=skylake-avx512
ZMM registers: gcc simd.c -o simd -O3 -march=skylake-avx512 -mprefer-vector-width=512
However, using the flags -march=skylake-avx512 -mprefer-vector-width=512 combined with -fopenmp does not generates SIMD instructions.
Therefore, I can easily vectorize my code with -O3 without the pragma omp for simd but not for the other way around.
At this point, my purpose is not to generate SIMD instructions but to understand how do OpenMP SIMD directives work in GCC and how to generate SIMD instructions only with OpenMP (without -O3).
Enable at least -O2 for -fopenmp to work, and for performance in general
gcc simd.c -S -fopenmp
GCC's default is -O0, anti-optimized for consistent debugging. It's never going to auto-vectorize with -O0 because it's pointless when every i value from the C source has to exist in memory, and so on. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
Also impossible when you have to be able to single-step source lines one at a time, and even modify i or memory contents at runtime with the debugger, and have the program keep running like you'd expect the C abstract machine would.
Building without any optimization is utter garbage for performance; it's insane to even consider if you care about performance enough to be using OpenMP. (Except of course for actual debugging.) Often the speedup from anti-optimized to optimized scalar is more than what you could gain from vectorizing that scalar code, but both can be large factors so you definitely want optimizations beyond auto-vectorization.
I can use SIMD registers without OpenMP using the option -O3 because according to GCC documentation it includes the -ftree-vectorize flag.
Right, so do that. -O3 -march=native -flto is usually your best bet for code that will run on the compile host. Also -fno-trapping-math -fno-math-errno should be safe for everything and enable some better FP function inlining, even if you don't want -ffast-math. Also preferably -fprofile-generate / -fprofile-use profile-guided optimization (PGO), to unroll hot loops and choose branchy vs. branchless appropriately, etc.
#pragma omp parallel is still effective at -O3 -fopenmp - GCC doesn't enable autoparallelization by default.
Also, #pragma omp simd will use a different vectorization style sometimes. In your case, it seems to make GCC forget that it knows the arrays are 16-byte aligned, and use movdqu loads (when AVX isn't available for an unaligned memory source operand for paddd xmm0, [rax]). Compare https://godbolt.org/z/8q8Dqm - the main._omp_fn.0: helper function that main calls doesn't assume alignment. (Although maybe it can't after division by number of threads splits up the array into ranges, if GCC doesn't bother to do vector-sized chunks?)
Use -O2 -fopenmp to get what you were expecting
OpenMP will let gcc vectorize more easily or efficiently for loops where you didn't use restrict on pointer args to functions to let it know that arrays don't overlap, or for floating point to let it pretend that FP math is associative even if you didn't use -ffast-math.
Or if you enable some optimization but not full optimization (e.g. -O2 which doesn't include -ftree-vectorize), then #pragma omp will work the way you expected.
Note that the x[i] = y[i] = i; init loop doesn't get auto-vectorized at -O2, but the #pragma loops are. And that without -fopenmp, pure scalar. Godbolt compiler explorer
The serial -O3 code will run faster for this small N because thread-startup overhead is nowhere near worth it. But for large N, parallelization could help if a single core can't saturate memory bandwidth (e.g. on a Xeon, but most dual/quad-core desktop CPUs can almost saturate mem bandwidth with one core). Or if your arrays are hot in cache on different cores.
Unfortunately(?) even GCC -O3 doesn't manage to do constant-propagation through your whole code and just print the result. Or to fuse the z[i] = x[i]+y[i] loop with the sum(x[]) loop.

Use openMP only when an argument is passed to the program

Is there a good way to use OpenMP to parallelize a for-loop, only if an -omp argument is passed to the program?
This seems not possible, since #pragma omp parallel for is a preprocessor directive and thus evaluated even before compile time and of course it is only certain if the argument is passed to the program at runtime.
At the moment I am using a very ugly solution to achieve this, which leads to an enormous duplication of code.
if(ompDefined) {
#pragma omp parallel for
for(...)
...
}
else {
for(...)
...
}
I think what you are looking for can be solved using a CPU dispatcher technique.
For benchmarking OpenMP code vs. non-OpenMP code you can create different object files from the same source code like this
//foo.c
#ifdef _OPENMP
double foo_omp() {
#else
double foo() {
#endif
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i<1000000000; i++) sum += i%10;
return sum;
}
Compile like this
gcc -O3 -c foo.c
gcc -O3 -fopenmp -c foo.c -o foo_omp.o
This creates two object files foo.o and foo_omp.o. Then you can call one of these functions like this
//bar.c
#include <stdio.h>
double foo();
double foo_omp();
double (*fp)();
int main(int argc, char *argv[]) {
if(argc>1) {
fp = foo_omp;
}
else {
fp = foo;
}
double sum = fp();
printf("sum %e\n", sum);
}
Compile and link like this
gcc -O3 -fopenmp bar.c foo.o foo_omp.o
Then I time the code like this
time ./a.out -omp
time ./a.out
and the first case takes about 0.4 s and the second case about 1.2 s on my system with 4 cores/8 hardware threads.
Here is a solution which only needs a single source file
#include <stdio.h>
typedef double foo_type();
foo_type foo, foo_omp, *fp;
#ifdef _OPENMP
#define FUNCNAME foo_omp
#else
#define FUNCNAME foo
#endif
double FUNCNAME () {
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for(int i=0; i<1000000000; i++) sum += i%10;
return sum;
}
#ifdef _OPENMP
int main(int argc, char *argv[]) {
if(argc>1) {
fp = foo_omp;
}
else {
fp = foo;
}
double sum = fp();
printf("sum %e\n", sum);
}
#endif
Compile like this
gcc -O3 -c foo.c
gcc -O3 -fopenmp foo.c foo.o
You can set the number of threads at run-time by calling omp_set_num_threads:
#include <omp.h>
int main()
{
int threads = 1;
#ifdef _OPENMP
omp_set_num_threads(threads);
#endif
#pragma omp parallel for
for(...)
{
...
}
}
This isn't quite the same as disabling OpenMP, but it will stop it running calculations in parallel. I've found it's always a good idea to set this using a command line switch (you can implement this using GNU getopt or Boost.ProgramOptions). This allows you to easily run single-threaded and multi-threaded tests on the same code.
As Vladimir F pointed out in the comments, you can also set the number of threads by setting the environment variable OMP_NUM_THREADS before executing your program:
gcc -Wall -Werror -pedantic -O3 -fopenmp -o test test.c
OMP_NUM_THREADS=1
./test
unset OMP_NUM_THREADS
Finally, you can disable OpenMP at compile-time by not providing GCC with the -fopenmp option. However, you will need to put preprocessor guards around any lines in your code that require OpenMP to be enabled (see above). If you want to use some functions included in the OpenMP library without actually enabling the OpenMP pragmas you can simply link against the OpenMP library by replacing the -fopenmp option with -lgomp.
One solution would be to use the preprocessor to ignore the pragma statement if you do not pass an additional flag to the compiler.
For example in your code you might have:
#ifdef MP_ENABLED
#pragma omp parallel for
#endif
for(...)
...
and then when you compile you can pass a flag to the compiler to define the MP_ENABLED macro. In the case of GCC (and Clang) you would pass -DMP_ENABLED.
You then might compile with gcc as
gcc SOME_SOURCE.c -I SOME_INCLUDE.h -lomp -DMP_ENABLED -o SOME_OUTPUT
then when you want to disable the parallelism you can make a minor tweek to the compile command by dropping -DMP_ENABLED.
gcc SOME_SOURCE.c -I SOME_INCLUDE.h -lomp -DMP_ENABLED -o SOME_OUTPUT
This causes the macro to be undefined which leads to the preprocessor ignoring the pragma.
You could also use a similar solution using ifndef instead depending on whether you consider the parallel behavior the default or not.
Edit: As noted by some comments, inclusion of OMP lib defines some macros such as _OPENMP which you could use in place of your own user-defined macros. That looks to be a superior solution, but the difference in effort is reasonably small.

How to use OpenMP in a C project built with meson

I am quite new to meson and C, please forgive me if the answer to this question is trivial ...
I want to use OpenMP in a C project, and I am using meson as a build tool.
I want to compile the parallel for example from this tutorial.
My main.c looks very similar:
#include <omp.h>
#define N 1000
#define CHUNKSIZE 100
int main(int argc, char *argv[]) {
int i, chunk;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel for \
shared(a,b,c,chunk) private(i) \
schedule(static,chunk)
for (i=0; i < N; i++)
c[i] = a[i] + b[i];
return 0;
}
My short meson.build file contains this:
project('openmp_with_meson', 'c')
# add_project_arguments('-fopenmp', language: 'c')
exe = executable('some_exe', 'src/main.c') #, c_args: '-fopenmp')
I commented out the c_args keyword in the call to executable here.
Now I end up with the following scenarios:
without '-fopenmp' option, I get the warning, that the pragma is unknown and will be ignored (as I would expect): ../src/main.c:15:0: warning: ignoring pragma omp parallel [-Wunknown-pragmas] #pragma omp parallel for
with the option c_args: '-fopenmp' inserted, I do not get the above warning anymore, instead I get errors for undefined references to GOMP_parallel, omp_get_num_threads and omp_get_thread_num, and nothing gets built
when I use gcc manually with gcc -Wall -o manually_with_gcc ../src/main.c -fopenmp the program compiles and executes without any errors.
Can anyone tell me how to get the executable to compile with meson?
Meson 0.46 or later
Meson 0.46 (released Apr 23, 2018) added OpenMP support. So, if you have meson 0.46 or later,
project('openmp_with_meson', 'c')
omp = dependency('openmp')
exe = executable('some_exe', 'src/main.c',
dependencies : omp)
Should work with both GCC and Clang.
Meson 0.45 or earlier
If you happen to have older version, Debian Stretch, Ubuntu Bionic (18.04LTS), or Fedora 27, you can do the following:
You need another keyword arg link_args : '-fopenmp' for executable().
exe = executable('some_exe', 'src/main.c',
c_args: '-fopenmp',
link_args : '-fopenmp')
Meson builds C program in two phases, compiling and linking. You can pass extra arguments with c_args for compiling and link_args for linking.
The option -fopenmp enables OpenMP directives while compiling, and
the flag also arranges for automatic linking of the OpenMP runtime
library.
That is, -fopenmp is dual purpose option.
Now, the above is simple and good. Once you understand it, however, you can also compile your program with -fopenmp to activate the OpenMP directives and link the OpenMP libraries by yourself without -fopenmp to the link_args.
Here is a complete meson.build:
project('openmp_with_meson', 'c')
cc = meson.get_compiler('c')
libgomp = cc.find_library('gomp')
exe = executable('some_exe', 'src/main.c',
c_args: '-fopenmp',
dependencies : libgomp)
Meson >= 0.46 now has a builtin for this (docs):
openmp = dependency('openmp') # meson builtin

Force the compiler duplicate c inline function code which been called from inside a loop [duplicate]

How can I tell GCC to unroll a particular loop?
I have used the CUDA SDK where loops can be unrolled manually using #pragma unroll. Is there a similar feature for gcc? I googled a bit but could not find anything.
GCC gives you a few different ways of handling this:
Use #pragma directives, like #pragma GCC optimize ("string"...), as seen in the GCC docs. Note that the pragma makes the optimizations global for the remaining functions. If you used #pragma push_options and pop_options macros cleverly, you could probably define this around just one function like so:
#pragma GCC push_options
#pragma GCC optimize ("unroll-loops")
//add 5 to each element of the int array.
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
#pragma GCC pop_options
Annotate individual functions with GCC's attribute syntax: check the GCC function attribute docs for a more detailed dissertation on the subject. An example:
//add 5 to each element of the int array.
__attribute__((optimize("unroll-loops")))
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
Note: I'm not sure how good GCC is at unrolling reverse-iterated loops (I did it to get Markdown to play nice with my code). The examples should compile fine, though.
GCC 8 has gained a new pragma that allows you to control how loop unrolling is done:
#pragma GCC unroll n
Quoting from the manual:
You can use this pragma to control how many times a loop should be
unrolled. It must be placed immediately before a for, while or do loop
or a #pragma GCC ivdep, and applies only to the loop that follows. n
is an integer constant expression specifying the unrolling factor. The
values of 0 and 1 block any unrolling of the loop.
-funroll-loops might be helpful (though it turns on loop-unrolling globally, not per-loop). I'm not sure whether there's a #pragma to do the same...

Tell gcc to specifically unroll a loop

How can I tell GCC to unroll a particular loop?
I have used the CUDA SDK where loops can be unrolled manually using #pragma unroll. Is there a similar feature for gcc? I googled a bit but could not find anything.
GCC gives you a few different ways of handling this:
Use #pragma directives, like #pragma GCC optimize ("string"...), as seen in the GCC docs. Note that the pragma makes the optimizations global for the remaining functions. If you used #pragma push_options and pop_options macros cleverly, you could probably define this around just one function like so:
#pragma GCC push_options
#pragma GCC optimize ("unroll-loops")
//add 5 to each element of the int array.
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
#pragma GCC pop_options
Annotate individual functions with GCC's attribute syntax: check the GCC function attribute docs for a more detailed dissertation on the subject. An example:
//add 5 to each element of the int array.
__attribute__((optimize("unroll-loops")))
void add5(int a[20]) {
int i = 19;
for(; i > 0; i--) {
a[i] += 5;
}
}
Note: I'm not sure how good GCC is at unrolling reverse-iterated loops (I did it to get Markdown to play nice with my code). The examples should compile fine, though.
GCC 8 has gained a new pragma that allows you to control how loop unrolling is done:
#pragma GCC unroll n
Quoting from the manual:
You can use this pragma to control how many times a loop should be
unrolled. It must be placed immediately before a for, while or do loop
or a #pragma GCC ivdep, and applies only to the loop that follows. n
is an integer constant expression specifying the unrolling factor. The
values of 0 and 1 block any unrolling of the loop.
-funroll-loops might be helpful (though it turns on loop-unrolling globally, not per-loop). I'm not sure whether there's a #pragma to do the same...

Resources