I want to test #pragma omp parallel for and #pragma omp simd for a simple matrix addition program. When I use each of them separately, I get no error and it seems fine. But, I want to test how much performance can be gained using both of them. If I use #pragma omp parallel for before the outer loop and #pragma omp simd before the inner loop I get no error as well. The error occures when I use both of them before the outer loop. I get an error at runtime not compile time. ICC and GCC return error but Clang doesn't. It might be because Clang regect the parallelization. In my experiments, Clang does not parallelize and run the program with only one thread.
The program is here:
#include <stdio.h>
//#include <x86intrin.h>
#define N 512
#define M N
int __attribute__(( aligned(32))) a[N][M],
__attribute__(( aligned(32))) b[N][M],
__attribute__(( aligned(32))) c_result[N][M];
int main()
{
int i, j;
#pragma omp parallel for
#pragma omp simd
for( i=0;i<N;i++){
for(j=0;j<M;j++){
c_result[i][j]= a[i][j] + b[i][j];
}
}
return 0;
}
The error for:
ICC:
IMP1.c(20): error: omp directive is not followed by a parallelizable
for loop #pragma omp parallel for ^
compilation aborted for IMP1.c (code 2)
GCC:
IMP1.c: In function ‘main’:
IMP1.c:21:10: error: for statement
expected before ‘#pragma’ #pragma omp simd
Because in my other testes pragma omp simd for outer loop gets better performance I need to put that there (don't I?).
Platform: Intel Core i7 6700 HQ, Fedora 27
Tested compilers: ICC 18, GCC 7.2, Clang 5
Compiler command line:
icc -O3 -qopenmp -xHOST -no-vec
gcc -O3 -fopenmp -march=native -fno-tree-vectorize -fno-tree-slp-vectorize
clang -O3 -fopenmp=libgomp -march=native -fno-vectorize -fno-slp-vectorize
From OpenMP 4.5 Specification:
2.11.4 Parallel Loop SIMD Construct
The parallel loop SIMD construct is a shortcut for specifying a parallel
construct containing one loop SIMD construct and no other statement.
The syntax of the parallel loop SIMD construct is as follows:
#pragma omp parallel for simd
...
You can also write:
#pragma omp parallel
{
#pragma omp for simd
for ...
}
Related
Suppose I would like to run the functions in parallel.
void foo()
{
foo1(args);
foo2(args);
foo3(args);
foo4(args);
}
I want these functions calls run in parallel. How can I run these functions in parallel in OpenMP with C?
Assuming that the code is running serially when you enter foo(), you have a couple of different options.
Option 1: use sections
void foo()
{
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
foo1(args);
#pragma omp section
foo2(args);
#pragma omp section
foo3(args);
#pragma omp section
foo4(args);
}
}
}
Option 2: use tasks
void foo()
{
#pragma omp parallel
{
#pragma omp single
{
#pragma omp task
foo1(args);
#pragma omp task
foo2(args);
#pragma omp task
foo3(args);
#pragma omp task
foo4(args);
}
}
}
Tasks are the more modern way of expressing this, and, potentially, allow you more freedom in controlling the execution.
Short: Does the pragma omp for simd OpenMP directive generate code that uses SIMD registers?
Longer:
As stated in the OpenMP documentation "The worksharing-loop SIMD construct specifies that the iterations of one or more associated loops will be distributed across threads that already exist [..] using SIMD instructions". From this statement, I would expect the following code (simd.c) to use XMM, YMM or ZMM registers when compiling running gcc simd.c -o simd -fopenmp but it does not.
#include <stdio.h>
#define N 100
int main() {
int x[N];
int y[N];
int z[N];
int i;
int sum;
for(i=0; i < N; i++) {
x[i] = i;
y[i] = i;
}
#pragma omp parallel
{
#pragma omp for simd
for(i=0; i < N; i++) {
z[i] = x[i] + y[i];
}
#pragma omp for simd reduction(+:sum)
for(i=0; i < N; i++) {
sum += x[i];
}
}
printf("%d %d\n",z[N/2], sum);
return 0;
}
When checking the assembler generated running gcc simd.c -S -fopenmp no SIMD register is used.
I can use SIMD registers without OpenMP using the option -O3 because according to GCC documentation
it includes the -ftree-vectorize flag.
XMM registers: gcc simd.c -o simd -O3
YMM registers: gcc simd.c -o simd -O3 -march=skylake-avx512
ZMM registers: gcc simd.c -o simd -O3 -march=skylake-avx512 -mprefer-vector-width=512
However, using the flags -march=skylake-avx512 -mprefer-vector-width=512 combined with -fopenmp does not generates SIMD instructions.
Therefore, I can easily vectorize my code with -O3 without the pragma omp for simd but not for the other way around.
At this point, my purpose is not to generate SIMD instructions but to understand how do OpenMP SIMD directives work in GCC and how to generate SIMD instructions only with OpenMP (without -O3).
Enable at least -O2 for -fopenmp to work, and for performance in general
gcc simd.c -S -fopenmp
GCC's default is -O0, anti-optimized for consistent debugging. It's never going to auto-vectorize with -O0 because it's pointless when every i value from the C source has to exist in memory, and so on. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
Also impossible when you have to be able to single-step source lines one at a time, and even modify i or memory contents at runtime with the debugger, and have the program keep running like you'd expect the C abstract machine would.
Building without any optimization is utter garbage for performance; it's insane to even consider if you care about performance enough to be using OpenMP. (Except of course for actual debugging.) Often the speedup from anti-optimized to optimized scalar is more than what you could gain from vectorizing that scalar code, but both can be large factors so you definitely want optimizations beyond auto-vectorization.
I can use SIMD registers without OpenMP using the option -O3 because according to GCC documentation it includes the -ftree-vectorize flag.
Right, so do that. -O3 -march=native -flto is usually your best bet for code that will run on the compile host. Also -fno-trapping-math -fno-math-errno should be safe for everything and enable some better FP function inlining, even if you don't want -ffast-math. Also preferably -fprofile-generate / -fprofile-use profile-guided optimization (PGO), to unroll hot loops and choose branchy vs. branchless appropriately, etc.
#pragma omp parallel is still effective at -O3 -fopenmp - GCC doesn't enable autoparallelization by default.
Also, #pragma omp simd will use a different vectorization style sometimes. In your case, it seems to make GCC forget that it knows the arrays are 16-byte aligned, and use movdqu loads (when AVX isn't available for an unaligned memory source operand for paddd xmm0, [rax]). Compare https://godbolt.org/z/8q8Dqm - the main._omp_fn.0: helper function that main calls doesn't assume alignment. (Although maybe it can't after division by number of threads splits up the array into ranges, if GCC doesn't bother to do vector-sized chunks?)
Use -O2 -fopenmp to get what you were expecting
OpenMP will let gcc vectorize more easily or efficiently for loops where you didn't use restrict on pointer args to functions to let it know that arrays don't overlap, or for floating point to let it pretend that FP math is associative even if you didn't use -ffast-math.
Or if you enable some optimization but not full optimization (e.g. -O2 which doesn't include -ftree-vectorize), then #pragma omp will work the way you expected.
Note that the x[i] = y[i] = i; init loop doesn't get auto-vectorized at -O2, but the #pragma loops are. And that without -fopenmp, pure scalar. Godbolt compiler explorer
The serial -O3 code will run faster for this small N because thread-startup overhead is nowhere near worth it. But for large N, parallelization could help if a single core can't saturate memory bandwidth (e.g. on a Xeon, but most dual/quad-core desktop CPUs can almost saturate mem bandwidth with one core). Or if your arrays are hot in cache on different cores.
Unfortunately(?) even GCC -O3 doesn't manage to do constant-propagation through your whole code and just print the result. Or to fuse the z[i] = x[i]+y[i] loop with the sum(x[]) loop.
My OpenMP program is like this:
#include <stdio.h>
#include <omp.h>
int main (void)
{
int i = 10;
#pragma omp parallel lastprivate(i)
{
printf("thread %d: i = %d\n", omp_get_thread_num(), i);
i = 1000 + omp_get_thread_num();
}
printf("i = %d\n", i);
return 0;
}
Use gcc to compile it and generate following errors:
# gcc -fopenmp test.c
test.c: In function 'main':
test.c:8:26: error: 'lastprivate' is not valid for '#pragma omp parallel'
#pragma omp parallel lastprivate(i)
^~~~~~~~~~~
Why does OpenMP forbid use lastprivate in #pragma omp parallel?
The meaning of lastprivate, is to assign "the sequentially last iteration of the associated loops, or the lexically last section construct [...] to the original list item."
Hence, there it no meaning for a pure parallel construct. It would not be a good idea to use a meaning like "the last thread to exit the parallel construct" - that would be a race condition.
I have a single block enclosed in a sections block like this
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main (int argc, char *argv[])
{
int nthreads, tid;
/* Fork a team of threads giving them their own copies of variables */
#pragma omp parallel private(tid)
{
#pragma omp sections
{
#pragma omp section
{
printf("First section %d \n" , tid);
}
#pragma omp section
{
#pragma omp single
{
printf("Second Section block %d \n" , tid);
}
}
}
} /* All threads join master thread and disband */
printf("Outside parallel block \n");
}
When i compile this code the compiler gives the following warning
work-sharing region may not be closely nested inside of work-sharing, critical, ordered or master region
Why is that ?
It gives you this warning because you have an openmp single region nested inside an openmp sections region without an openmp parallel region nested between them.
This is known as a closely nested region.
In C, the worksharing constructs are for, sections, and single.
For further information see the OpenMP Specification or see Intel's Documentation on Improper nesting of OpenMP* constructs.
In order to have the code compile cleanly, try replacing your #pragma omp sections with #pragma omp parallel sections
or enclosing #pragma omp sections with #pragma omp parallel.
See Guide into OpenMP: Easy multithreading programming for C++ for more information and examples.
I'm just getting started experimenting adding OpenMP to some SSE code.
My first test program SOMETIMES crashes in _mm_set_ps, but works when I set the if (0).
It looks so simple I must be missing something obvious.
I'm compiling with gcc -fopenmp -g -march=core2 -pthreads
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
int main()
{
#pragma omp parallel if (1)
{
#pragma omp sections
{
#pragma omp section
{
__m128 x1 = _mm_set_ps ( 1.1f, 2.1f, 3.1f, 4.1f );
}
#pragma omp section
{
__m128 x2 = _mm_set_ps ( 1.2f, 2.2f, 3.2f, 4.2f );
}
} // end omp sections
} //end omp parallel
return 0;
}
This is a bug in the openMP implementation. I was having the same problem in gcc on Windows (MinGW). -mstackrealign command line option solved my problem. This adds an instruction to the prolog of every function to realign the stack at the 16-byte boundary. I didn't notice any performance penalty. You can also try to add __attribute__ ((force_align_arg_pointer)) to a function declaration, which should do the same, but only for a specific function. You might have to put the SSE code in a separate function that you then call from the function with #pragma omp, so that the stack has a chance to be realigned.
I stopped having the problem when I moved onto compiling for a 64-bit target (MinGW64, such as TDM GCC build).
I am playing with AVX instructions which require a 32-byte alignment, but GCC doesn't support that for windows at all. This forced me to fix the produced assembly code using a python script, but it works.
I smell unaligned memory access. Its the only way code like that could explode(assuming that is the only code there). For that to happen the XMM registers wouldn't be used but rather stack memory, which is only aligned to 4 bytes, my guess is the omp code is messing up the alignment of the stack.