Using OpenMP stops GCC auto vectorising - c

I have been working on making my code able to be auto vectorised by GCC, however, when I include the the -fopenmp flag it seems to stop all attempts at auto vectorisation. I am using the ftree-vectorize -ftree-vectorizer-verbose=5 to vectorise and monitor it.
If I do not include the flag, it starts to give me a lot of information about each loop, if it is vectorised and why not. The compiler stops when I try to use the omp_get_wtime() function, since it can't be linked. Once the flag is included, it simply lists every function and tells me it vectorised 0 loops in it.
I've read a few other places the issue has been mentioned, but they don't really come to any solutions: http://software.intel.com/en-us/forums/topic/295858 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032. Does OpenMP have its own way of handling vectorisation? Does I need to explicitly tell it to?

There is a shortcoming in the GCC vectoriser which appears to have been resolved in recent GCC versions. In my test case GCC 4.7.2 vectorises successfully the following simple loop:
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
a[i] = b[i] + c[i] * d;
In the same time GCC 4.6.1 does not and it complains, that the loop contains function calls or data references that cannot be analysed. The bug in the vectoriser is triggered by the way parallel for loops are implemented by GCC. When the OpenMP constructs are processed and expanded, the simple loop code is transformed into something akin to this:
struct omp_fn_0_s
{
int N;
double *a;
double *b;
double *c;
double d;
};
void omp_fn_0(struct omp_fn_0_s *data)
{
int start, end;
int nthreads = omp_get_num_threads();
int threadid = omp_get_thread_num();
// This is just to illustrate the case - GCC uses a bit different formulas
start = (data->N * threadid) / nthreads;
end = (data->N * (threadid+1)) / nthreads;
for (int i = start; i < end; i++)
data->a[i] = data->b[i] + data->c[i] * data->d;
}
...
struct omp_fn_0_s omp_data_o;
omp_data_o.N = N;
omp_data_o.a = a;
omp_data_o.b = b;
omp_data_o.c = c;
omp_data_o.d = d;
GOMP_parallel_start(omp_fn_0, &omp_data_o, 0);
omp_fn_0(&omp_data_o);
GOMP_parallel_end();
N = omp_data_o.N;
a = omp_data_o.a;
b = omp_data_o.b;
c = omp_data_o.c;
d = omp_data_o.d;
The vectoriser in GCC before 4.7 fails to vectorise that loop. This is NOT OpenMP-specific problem. One can easily reproduce it with no OpenMP code at all. To confirm this I wrote the following simple test:
struct fun_s
{
double *restrict a;
double *restrict b;
double *restrict c;
double d;
int n;
};
void fun1(double *restrict a,
double *restrict b,
double *restrict c,
double d,
int n)
{
int i;
for (i = 0; i < n; i++)
a[i] = b[i] + c[i] * d;
}
void fun2(struct fun_s *par)
{
int i;
for (i = 0; i < par->n; i++)
par->a[i] = par->b[i] + par->c[i] * par->d;
}
One would expect that both codes (notice - no OpenMP here!) should vectorise equally well because of the restrict keywords used to specify that no aliasing can happen. Unfortunately this is not the case with GCC < 4.7 - it successfully vectorises the loop in fun1 but fails to vectorise that in fun2 citing the same reason as when it compiles the OpenMP code.
The reason for this is that the vectoriser is unable to prove that par->d does not lie within the memory that par->a, par->b, and par->c point to. This is not always the case with fun1, where two cases are possible:
d is passed as a value argument in a register;
d is passed as a value argument on the stack.
On x64 systems the System V ABI mandates that the first several floating-point arguments get passed in the XMM registers (YMM on AVX-enabled CPUs). That's how d gets passed in this case and hence no pointer can ever point to it - the loop gets vectorised. On x86 systems the ABI mandates that arguments are passed onto the stack, hence d might be aliased by any of the three pointers. Indeed, GCC refuses to vectorise the loop in fun1 if instructed to generate 32-bit x86 code with the -m32 option.
GCC 4.7 gets around this by inserting run-time checks which ensure that neither d nor par->d get aliased.
Getting rid of d removes the unprovable non-aliasing and the following OpenMP code gets vectorised by GCC 4.6.1:
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
a[i] = b[i] + c[i];

I'll try to briefly answer your question.
Does OpenMP have its own way of handling vectorisation?
Yes... but starting from the incoming OpenMP 4.0. The link posted above provides a good insight on this construct. The current OpenMP 3.1, on the other hand, is not "aware" of the SIMD concept. What happens therefore in practice (or, at least, in my experience) is that auto-vectorization mechanisms are inhibited whenever an openmp worksharing construct is used on a loop. Anyhow the two concepts are orthogonal and you can still benefit from both (see this other answer).
Do I need to explicitly tell it to?
I am afraid yes, at least at present. I would start rewriting the loops under consideration in a way that makes vectorization explicit (i.e. I will use intrinsics on Intel platform, Altivec on IBM and so on).

You are asking "why GCC can't do vectorization when OpenMP is enabled?".
It seems that this may be a bug of GCC :)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032
Otherwise, an OpenMP API may introduce dependency (either control or data) that prevents automatic vectorization. To auto-vertorize, a given code must be data/control-dependency free. It's possible that using OpenMP may cause some spurious dependency.
Note: OpenMP (prior to 4.0) is to use thread-level parallelism, which is orthogonal to SIMD/vectorization. A program can use both OpenMP and SIMD parallelism at the same time.

I ran across this post while searching for comments about the gcc 4.9 option openmp-simd, which should activate OpenMP 4 #pragma omp simd without activating omp parallel (threading). gcc bugzilla pr60117 (confirmed) shows a case where the pragma omp prevents auto-vectorization which occurred without the pragma.
gcc doesn't vectorize omp parallel for even with the simd clause (parallel regions can auto-vectorize only the inner loop nested under a parallel for). I don't know any compiler other than icc 14.0.2 which could be recommended for implementation of #pragma omp parallel for simd; with other compilers, SSE intrinsics coding would be required to get this effect.
The Microsoft compiler doesn't perform any auto-vectorization inside parallel regions in my tests, which show clear superiority of gcc for such cases.
Combined parallelization and vectorization of a single loop has several difficulties, even with the best implementation. I seldom see more than 2x or 3x speedup by adding vectorization to a parallel loop. Vectorization with AVX double data type, for example, effectively cuts the chunk size by a factor of 4. Typical implementation can achieve aligned data chunks only for the case where the entire array is aligned, and the chunks also are exact multiples of the vector width. When the chunks are not all aligned, there is inherent work imbalance due to the varying alignments.

Related

OpenMP - Initializing 1D Array

I am trying to initialize a dynamic array using OpenMP in C but it seems to be slower than the serial method. The function I am using is
int* createArray(int size, int num) {
int i;
int* M = (int*)malloc(size*sizeof(int));
srand(time(NULL));
double start = omp_get_wtime();
#pragma omp parallel for num_threads(num)
for (i = 0; i < size; i++) {
M[i] = rand() % (MAX_NUMBER - MIN_NUMBER + 1) + MIN_NUMBER;
}
double end = omp_get_wtime();
printf("Create Array %f\n",end-start);
return M;
}
I get an array of the given size containing random numbers but the fewer threads I use the faster the function is. Am I doing something wrong?
In general, a parallel application running slower than the corresponding sequential implementation usually comes from either the overhead of starting the threads or the bottleneck of having threads that are not perfectly independent (e.g. through shared data).
Here, the former is true because you are calling rand(). This function uses somehow global variables which have to be shared between threads. A way to overcome this would be to use a private seed for each thread. Furthermore, did you notice that your array is not really random when you have multiple threads? You could make the seed provided to srand() a function of omp_get_thread_num() to solve this.
I'm quite sure your program is suffering a problem called "false sharing" cache.
The article below explains it quite well.
https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
this often affects performance a lot.
you can quickly have a test. add below to your omp pragma
schedule(static, 16)
this should improve a lot. then you can dig further about false sharing.

Get Loop Index at Start of OpenMP Loop

I have a for loop, each iteration of which is almost completely independent of every other iteration. I did some initial experimenting with OpenMP to see if I could speed it up. Using one simple directive, I got a three to four fold speed increase. The code was something akin to this:
#pragma omp parallel for default(none) shared(ptr1,ptr2) firstprivate(const2,k,inc,max)
for(i = 0; i < max; i += inc)
{
float *ptr1_ = (*ptr1)[i>>k][0];
float v = ptr2[i/const2];
// do stuff with data
}
So then I went off and optimized the single threaded code. In the process, I discovered I could increment pointers instead of indexing them, and for whatever reason, this yielded a reasonable speed improvement. The problem now is, I can't figure out a simple way to tell OpenMP where the pointers start for each thread. Essentially, what I came up with was the following.
#pragma omp parallel default(none) shared(ptr1,ptr2) firstprivate(const1,inc,max)
{
int chunk = max / (omp_get_num_threads()*inc);
chunk = (chunk < 1)? 1: chunk;
float *ptr1_ = &(*ptr1)[0][0] + chunk*omp_get_thread_num()*const1;
float *ptr2_ = ptr2 + chunk*omp_get_thread_num();
#pragma omp for schedule(static,chunk)
for(i = 0; i < max; i += inc)
{
// do stuff with data
ptr1_ += const1;
ptr2_++;
}
}
This seems to work, although it took me some time to figure out how to compute the pointer offsets since inc is not one and I wasn't sure how this affects the meaning of chunk. Further, I'm not so confident it works correctly around the edge cases (when max is small or not an integral multiple of num_threads*inc). Also the code is much more complicated, and there are direct calls to omp functions, which I did not have to rely on before. And finally, by forcing the scheduling method and chunk size, I'm restricting the OpenMP implementation's potential optimizations.
So my question is, is there any way to get the loop index of the thread at the start of the loop so I don't have to manually compute the pointer offsets in this convoluted way?

openmp reduction does not provide the same answer as the sequential methodd

I am trying to parallelize a vector dot product program using OpenMP. The following code shows what I did.
#define N 1000000
float dotProduct = 0;
float vector1Host[N], vector2Host[N]; //each element in the vectors are initialized to a value between 1 and 2
#pragma omp parallel for private(i) reduction(+:dotProduct)
for (i = 0; i < N; i++)
dotProduct += vector1Host[i] * vector2Host[i];
The answer I get here is slightly different than what I get when I do the multiplication sequentially. Further, when I remove the reduction(+:dotProduct) and calculate the multiplications of each item seperately and add them together later (sequentially) I get the same answer as the completely sequential method.
float productComponents[N];
#pragma omp parallel for private(i)
for (i = 0; i < N; i++)
productComponents[i] += vector1Host[i] * vector2Host[i];
for (i=0; i<N; i++)
dotProduct += productComponents[i];
The issue with this method is the performance. Please help me in finding the error in the first method, or an alternative method with good performance.
Update:
I added the output from a sample run.
N=1000000: Ans=2251335.750000: Time(ms)=2.59163 //sequential
N=1000000: Ans=2251356.750000: Time(ms)=0.65846 //openmp
Floating point operations are not commutative. Therefore it is possible that your code is giving differing and possibly unpredictable results based on the order in which the floats are added to the accumulating variable.
Openmp due to the nature of parallelising the code results in the additions being performed in an arbitrary order and thus causes slightly unpredictable value due to the above non commutative behaviour of floats.
Either you need to accept this unpredictability or serialise the additions.
The other option would be to use a fixed point library which was able to guarantee commutative addition, in which case the answer would be predictable regardless of the resulting order of the additions.

Overhead and compiler optimization of dereference of nested struct elements

I'm wondering whether compilers (gcc with -O3 more specifically) can/will optimize out nested struct element dereferences (or not nested even).
For example, is there any point in doing the following code
register int i = 0;
register double multiple = struct1->struct2->element1;
for (i = 0; i < 10000; i++)
result[i] = multiple * -struct1->struct3->element3[i];
instead of
register int i = 0;
for (i = 0; i < 10000; i++)
result[i] = struct1->struct2->element1 * -struct1->struct3->element3[i];
I'm looking for the most optimized, but am not going to go through and bring outside of the loop struct dereferences if a compiler will optimize this out. If it does I think my best option is the following
register int i = 0;
register double* R = &result[0];
register double* amount = &struct1->struct3->element[0];
for (i = 0; i < 10000; i++, R++, amount++)
*R = struct1->struct2->element1 * -*amount;
which eliminates all unnecessary dereferences etc. (I think). Would the 2 deferences to get to element3 be optimized?
Any thoughts?
Thanks
This optimization is known as Loop-invariant code motion. Loop invariants (things that never change inside the loop) are moved outside of the loop, to avoid re-calculating the same thing over and over.
GCC supports it, and is enabled by the -fmove-loop-invariants flag:
-fmove-loop-invariants
Enables the loop invariant motion pass in the new loop optimizer. Enabled at level -O1
Today, compilers are almost always smart enough to do the "right thing" no matter how you formulate your code. Focus on writing the simplest, cleanest, easiest to read (for a human!) code you can. Let the compiler take care of the rest by enabling optimizations. -O2 is commonly used.

Does Intel array notation and elementary functions vectorize well with Xeon Phi ISA?

I try to find a proper material that clearly explains the different ways to write C/C++ source code that can be vectorized by the Intel compiler using array notation and elementary functions. All the materials online take trivial examples: saxpy, reduction etc. But there is a lack of explanation on how to vectorize a code that has conditional branching or contains a loop with loop-dependence.
For an example: say there is a sequential code I want to run with different arrays. A matrix is stored in major row format. The columns of the matrix is computed by the compute_seq() function:
#define N 256
#define STRIDE 256
__attribute__((vector))
inline void compute_seq(float *sum, float* a) {
int i;
*sum = 0.0f;
for(i=0; i<N; i++)
*sum += a[i*STRIDE];
}
int main() {
// Initialize
float *A = malloc(N*N*sizeof(float));
float sums[N];
// The following line is not going to be valid, but I would like to do somthing like this:
compute_seq(sums[:],*(A[0:N:1]));
}
Any comments appreciated.
Here is a corrected version of the example.
__attribute__((vector(linear(sum),linear(a))))
inline void compute_seq(float *sum, float* a) {
int i;
*sum = 0.0f;
for(i=0; i<N; i++)
*sum += a[i*STRIDE];
}
int main() {
// Initialize
float *A = malloc(N*N*sizeof(float));
float sums[N];
compute_seq(&sums[:],&A[0:N:N]);
}
The important change is at the call site. The expression &sums[:] creates an array section consisting of &sums[0], &sums[1], &sums[2], ... &sums[N-1]. The expression &A[0:N:N] creates an array section consisting of &A[0*N], &A[1*N], &A[2*N], ...&A[(N-1)*N].
I added two linear clauses to the vector attribute to tell the compiler to generate a clone optimized for the case that the arguments are arithmetic sequences, as they are in this example. For this example, they (and the vector attribute) are redundant since the compiler can see both the callee and call site in the same translation unit and figure out the particulars for itself. But if compute_seq were defined in another translation unit, the attribute might help.
Array notation is a work in progress. icc 14.0 beta compiled my example for Intel(R) Xeon Phi(TM) without complaint. icc 13.0 update 3 reported that it couldn't vectorize the function ("dereference too complex"). Perversely, leaving the vector attribute off shut up the report, probably because the compiler can vectorize it after inlining.
I use the compiler option "-opt-assume-safe-padding" when compiling for Intel(R) Xeon Phi(TM). It may improve vector code quality. It lets the compiler assume that the page beyond any accessed address is safe to touch, thus enabling certain instruction sequences that would otherwise be disallowed.

Resources