Data race in parallelized nested loop - c

I have a triple nested loop that I would like to parallelize, however, I am getting a data race issue. I am pretty sure that I need to use a reduction somehow, but I don't quite know how.
This is the loop in question:
#pragma omp parallel for simd collapse(3)
for (uint64 u = 0; u < nu; ++u) {
for (uint64 e = 0; e < ne; ++e) {
for (uint64 v = 0; v < nv; ++v) {
uAT[u][e] += _uT[u][e][v] * wA[e][v];
}
}
}
Could someone explain to me, why this causes a data race? I would really like to understand this, so that I don't run into these issues in the future. Also, can this loop be parallelized at all? If so, how?
EDIT: How do I know that there is a data race?
What this loop should accomplish (and it does in serial) is to compute the element average of a function in a Discontinuous Galerkin frame work. When I run the code a bunch of times, sometimes I get different results, eventhough it should always produce the same results. The resulting wrong values are always smaller than what the should be, which is why I assume that some values are not being added. Maybe this picture explains it better: The average in the third cell is obviously wrong (too small).

Original answer concerning multi-threading, not SIMD
By using the collapse(3) clause, the whole iteration space of nu * ne * nv iterations is distributed among threads. This means that for any combination of u and e, the v loop could be distributed among multiple threads as well. These can then access the same element uAT[u][e] in parallel, which is the data race.
As long as nu * ne is much bigger than the number of CPU cores that you work on, the easiest solution is to instead use collapse(2). As there can be inefficient implementations of collapse (See here), you might even want to leave it away completely depending on nu being big enough.
If you really need the parallelism from all three loops to efficiently use your hardware, you can use either reduction(+: uAT[0:nu][0:ne]) (add it into your existing pragma) or put a #pragma omp atomic update in front of uAT[u][e] += ...;. Which of these options is faster should be benchmarked. The reduction clause will use a lot more memory due to every thread getting its own private copy of the whole memory addressed through uAT. On the other hand the atomic update could in the worst case sequentialize part of your parallel work and give worse performance than using collapse(2).
Edit 1: SIMD
I just saw that you are also concerned with SIMD instead of just multi-threading. My original answer is about the latter. For SIMD I would first take a look at your compilers output without any OpenMP directives (e.g. on Compiler Explorer). If your compiler (with optimization and information on the target processor architecture) does already use SIMD instructions, you might not need to use OpenMP in the first place.
If you still want to use OpenMP for vectorizing your loops, leaving away the collapse clause and putting the pragma in front of the second loop would be my first ansatz. Leaving it in front of the first loop with collapse(2) might also work. Even a reduction should work but seems unnecessary complex in this context, as I would expect there to be enough parallelism in nu or nu * ne to fill your SIMD lanes. I have never used array reduction like described below in the SIMD context, so I'm not quite sure what it would do (i.e. allocating an array for each SIMD lane doesn't seem realistic), or if it even is part of the OpenMP standard (depends on the version of the standard, see here).
The way your code is written right now, the description of the data race for multi-threading technically still applies, I think. I'm not sure though if your code causes (efficient) vectorization at all, so the compiled binary might not have the data race (it might not be vectorized at all).
Edit 2: Threading + SIMD
I benchmarked a few versions of this loop nest for nu = 4, nv = 128 and ne between 1024 and 524288 (2^19). My benchmarking was done using google-benchmark (i.e. C++ instead of C, shouldn't matter here, I would think) with gcc 11.3 (always using -march=native -mtune=native, i.e. for portable performance this might not be helpful). I made sure to initialize all data in parallel (first touch policy) to avoid bad NUMA effects. I slightly modified the problem/code to use contiguous memory and do the multi-dimensional indexing manually. As OP's code didn't show the data type, I used float.
The three best versions I will share here are all three performing relatively similar. So depending on the hardware architecture there might be differences in which one is the best. For me the version inspired by Peter Cordes comments below this answer performed best:
#pragma omp parallel for
for (uint64_t e = 0UL; e < ne; ++e) {
// In the future we will get `#pragma omp unroll partial(4)`.
// The unrolling might actually not be necessary (or even a pessimization).
// So maybe leave it to the compiler.
#pragma GCC unroll 4
for (uint64_t u = 0UL; u < nu; ++u) {
float temp = 0.0f;
#pragma omp simd reduction(+ : temp)
for (uint64_t v = 0UL; v < nv; ++v) {
temp += uT[(u * ne + e) * nv + v] * wA[e * nv + v];
}
uAT[u * ne + e] += temp;
}
}
One could also have a float temp[nu]; and put the unrolled u loop inside the v loop to get even nearer to Peter Cordes' description, but then on would have to use an array reduction as described above. These array reductions consistently caused stack overflows for me, so I settled on this version which depends on nv being small enough that wA can still be cached between u iterations.
This second version just differs in the u loop staying on the outside:
#pragma omp parallel
for (uint64_t u = 0UL; u < nu; ++u) {
#pragma omp for
for (uint64_t e = 0UL; e < ne; ++e) {
float temp = 0.0f;
#pragma omp simd reduction(+ : temp)
for (uint64_t v = 0UL; v < nv; ++v) {
temp += uT[(u * ne + e) * nv + v] * wA[e * nv + v];
}
uAT[u * ne + e] += temp;
}
}
The third version uses collapse(2):
#pragma omp parallel for collapse(2)
for (uint64_t u = 0UL; u < nu; ++u) {
for (uint64_t e = 0UL; e < ne; ++e) {
float temp = 0.0f;
#pragma omp simd reduction(+ : temp)
for (uint64_t v = 0UL; v < nv; ++v) {
temp += uT[(u * ne + e) * nv + v] * wA[e * nv + v];
}
uAT[u * ne + e] += temp;
}
}
TL;DR:
I think the most important points can be seen in all three versions:
#pragma omp parallel for simd is nice for big simple loops. If you have a loop nest, you probably should split the pragma up.
Use simd on a loop which accesses contiguous elements in contiguous iterations.
By respecting the first two points, you don't need the possibly expensive array reduction.
By using a temporary (i.e. register) for the reduction instead of writing back to memory, you make using OpenMP easier and probably have better performance in serial code as well. Due to floating point non-associativity, this optimization can't be done by most compilers without allowing it via e.g. -ffast-math (gcc).
Most compilers can not vectorize the reduction on their own for the same reason.
Details like Using collapse or the the order of the e and the u loop are smaller optimizations which are not as important as long as you provide enough parallelism. I.e. don't parallelize over just u if nu is small (as the question author wrote below this answer).

Related

How to ensure data synchronization with OpenMP?

When I try to do the math expression from the following code the matrix values are not consistent, how can I fix that?
#pragma omp parallel num_threads (NUM_THREADS)
{
#pragma omp for
for(int i = 1; i < qtdPassos; i++)
{
#pragma omp critical
matriz[i][0] = matriz[i-1][0]; /
for (int j = 1; j < qtdElementos-1; j++)
{
matriz[i][j] = (matriz[i-1][j-1] + (2 * matriz[i-1][j]) + matriz[i-1][j+1]) / 4; // Xi(t+1) = [Xi-1 ^ (t) + 2 * Xi ^ (t)+ Xi+1 ^ (t)] / 4
}
matriz[i][qtdElementos-1] = matriz[i-1][qtdElementos-1];
}
}
The problem comes from a race condition which is due to a loop carried dependency. The encompassing loop cannot be parallelised (nor the inner loop) since loop iterations matriz read/write the current and previous row. The same applies for the column.
Note that OpenMP does not check if the loop can be parallelized (in fact, it theoretically cannot in general). It is your responsibility to check that. Additionally, note that using a critical section for the whole iteration serializes the execution defeating the purpose of a parallel loop (in fact, it will be slower due to the overhead of the critical section). Note also that #pragma omp critical only applies on the next statement. Protecting the line matriz[i][0] = matriz[i-1][0]; is not enough to avoid the race condition.
I do not think this current code can be (efficiently) parallelised. That being said, if your goal is to implement a 1D/2D stencil, then you can use a double buffering technique (ie. write in a 2D array that is different from the input array). A similar logic can be applied for 1D stencil repeated multiple times (which is apparently what you want to do). Note that the results will be different in that case. For the 1D stencil case, this double buffering strategy can fix the dependency issue and enable you to parallelize the inner-loop. For the 2D stencil case, the two nested loops can be parallelized.

OpenMP parallelize grouped array sum using pointers

I want to effectively parallelize the following sum in C:
#pragma omp parallel for num_threads(nth)
for(int i = 0; i < l; ++i) pout[pg[i]] += px[i];
where px is a pointer to a double array x of size l containing some data, pg is a pointer to an integer array g of size l that assigns each data point in x to one of ng groups which occur in a random order, and pout is a pointer to a double array out of size ng which is initialized with zeros and contains the result of summing x over the grouping defined by g.
The code above works, but the performance is not optimal so I wonder if there is somewthing I can do in OpenMP (such as a reduction() clause) to improve the execution. The dimensions l and ng of the arrays, and the number of threads nth are available to me and fixed beforehand. I cannot directly access the arrays, only the pointers are passed to a function which does the parallel sum.
Your code has a data race (at line pout[pg[i]] += ...), you should fix it first, then worry about its performance.
if ng is not too big and you use OpenMP 4.5+, the most efficient solution is using reduction: #pragma omp parallel for num_threads(nth) reduction(+:pout[:ng])
if ng is too big, most probably the best idea is to use a serial version of the program on PCs. Note that your code will be correct by adding #pragma omp atomic before pout[pg[i]] += .., but its performance is questionable.
From your description it sounds like you have a many-to-few mapping. That is a big problem for parallelism because you likely have write conflicts in the target array. Attempts to control with critical sections or locks will probably only slow down the code.
Unless it is prohibitive in memory, I would give each thread a private copy of pout and sum into that, then add those copies together. Now the reading of the source array can be nicely divided up between the threads. If the pout array is not too large, your speedup should be decent.
Here is the crucial bit of code:
#pragma omp parallel shared(sum,threadsum)
{
int thread = omp_get_thread_num(),
myfirst = thread*ngroups;
#pragma omp for
for ( int i=0; i<biglen; i++ )
threadsum[ myfirst+indexes[i] ] += 1;
#pragma omp for
for ( int igrp=0; igrp<ngroups; igrp++ )
for ( int t=0; t<nthreads; t++ )
sum[igrp] += threadsum[ t*ngroups+igrp ];
}
Now for the tricky bit. I'm using an index array of size 100M, but the number of groups is crucial. With 5000 groups I get good speedup, but with only 50, even though I've eliminated things like false sharing, I get pathetic or no speedup. This is not clear to me yet.
Final word: I also coded #Laci's solution of just using a reduction. Testing on 1M groups output: For 2-8 threads the reduction solution is actually faster, but for higher thread counts I win by almost a factor of 2 because the reduction solution repeatedly adds the whole array while I sum it just once, and then in parallel. For smaller numbers of groups the reduction is probably preferred overall.

Matrix Multiplication using OpenMP (C) - Collapsing all the loops

So I was learning about the basics OpenMP in C and work-sharing constructs, particularly for loop. One of the most famous examples used in all the tutorials is of matrix multiplication but all of them just parallelize the outer loop or the two outer loops. I was wondering why we do not parallelize and collapse all the 3 loops (using atomic) as I have done here:
for(int i=0;i<100;i++){
//Initialize the arrays
for(int j=0;j<100;j++){
A[i][j] = i;
B[i][j] = j;
C[i][j] = 0;
}
}
//Starting the matrix multiplication
#pragma omp parallel num_threads(4)
{
#pragma omp for collapse(3)
for(int i=0;i<100;i++){
for(int j=0;j<100;j++){
for(int k=0;k<100;k++){
#pragma omp atomic
C[i][j] = C[i][j]+ (A[i][k]*B[k][j]);
}
}
}
}
Can you tell me what I am missing here or why is this not an inferior/superior solution?
Atomic operations are very costly on most architectures compared to non-atomic ones (see here to understand why or here for a more detailed analysis). This is especially true when many threads make concurrent accesses to the same shared memory area. To put it simply, one cause is that threads performing atomic operations cannot fully run in parallel without waiting others on most hardware due to implicit synchronizations and communications coming from the cache coherence protocol. Another source of slowdowns is the high-latency of atomic operations (again due to the cache hierarchy).
If you want to write code that scale well, you need to minimize synchronizations and communications (including atomic operations).
As a result, using collapse(2) is much better than a collapse(3). But this is not the only issue is your code. Indeed, to be efficient you must perform memory accesses continuously and keep data in caches as much as possible.
For example, swapping the loop iterating over i and the one iterating over k (that does not work with collapse(2)) is several times faster on most systems due to more contiguous memory accesses (about 8 times on my PC):
for(int i=0;i<100;i++){
//Initialize the arrays
for(int j=0;j<100;j++){
A[i][j] = i;
B[i][j] = j;
C[i][j] = 0;
}
}
//Starting the matrix multiplication
#pragma omp parallel num_threads(4)
{
#pragma omp for
for(int i=0;i<100;i++){
for(int k=0;k<100;k++){
for(int j=0;j<100;j++){
C[i][j] = C[i][j] + (A[i][k]*B[k][j]);
}
}
}
}
Writing fast matrix-multiplication code is not easy. Consider using BLAS libraries such as OpenBLAS, ATLAS, Eigen or Intel MKL rather than writing your own code if your goal is to use this in production code. Indeed, such libraries are very optimized and often scale well on many cores.
If your goal is to understand how to write efficient matrix-multiplication codes, a good starting point may be to read this tutorial.
Collapsing loops requires that you know what you are doing as it may result in very cache-unfriendly splits of the iteration space or introduce data dependencies depending on how the product of the loop counts relates to the number of threads.
Imagine the following constructed example, which is not that uncommon actually (the loop counts are small just to illustrate the point):
for (int i = 0; i < 7; i++)
for (int j = 0; j < 3; j++)
a[i] += b[i][j];
If you parallelise the outer loop, three threads get two iterations and one thread gets just one, but all of them do all the iterations of the inner loop:
---0-- ---1-- ---2-- -3- (thread number)
000111 222333 444555 666 (values of i)
012012 012012 012012 012 (values of j)
Each a[i] gets processed by one thread only. Smart compilers may implement the inner loop using register optimisation, accumulating the values in a register first and only assigning to a[i] at the very end, and it will run very fast.
If you collapse the two loops, you end up in a very different situation. Since there is a total of 7x3 = 21 iterations now, the default split will be (depending on the compiler and the OpenMP runtime, but most of them do this) five iterations per thread and one gets six iterations:
--0-- --1-- --2-- ---3-- (thread number)
00011 12223 33444 555666 (values of i)
01201 20120 12012 012012 (values of j)
As you can see, now a[1] is processed by both thread 0 and thread 1. Similarly, a[3] is processed by both thread 1 and thread 2. And there you have it - you just introduced a data dependency that wasn't there in the previous case, so now you have to use atomic in order to prevent data races. That price that you pay for synchronisation is way higher than doing one iteration more or less! In your case, if you only collapse the two outer loops, you won't need to use atomic at all (although, in your particular case, 4 divides 100 and even when collapsing all the loops together you don't need the atomic construct, but you need it in the general case).
Another issue is that after collapsing the loops, there is a single loop index and both i and j indices have to be reconstructed from this new index using division and modulo operations. For simple loop bodies like yours, the overhead of reconstructing the indices may be simply too high.
There are very few good reasons not to use a library for matrix-matrix multiplication, so as suggested already, please call BLAS instead of writing this yourself. Nonetheless, the questions you ask are not specific to matrix-matrix multiplication, so they deserve to be answered anyways.
There are a few things that can be improved here:
Use contiguous memory.
If K is the innermost loop, you are doing dot-products, which are harder to vectorize. The loop order IKJ will vectorize better, for example.
If you want to parallelize a dot product with OpenMP, use a reduction instead of many atomics.
I have illustrated each of these techniques independently below.
Contiguous memory
int n = 100;
double * C = malloc(n*n*sizeof(double));
for(int i=0;i<n;i++){
for(int j=0;j<n;j++){
C[i*n+j] = 0.0;
}
}
IKJ loop ordering
for(int i=0;i<100;i++){
for(int k=0;k<100;k++){
for(int j=0;j<100;j++){
C[i][j] = C[i][j]+ (A[i][k]*B[k][j]);
}
}
}
Parallel dot-product
double x = 0;
#pragma omp parallel for reduction(+:x)
for(int k=0;k<100;k++){
x += (A[i][k]*B[k][j]);
}
C[i][j] += x;
External resources
How to Write Fast Numerical Code:
A Small Introduction covers these topics in far more detail.
BLISlab is an excellent tutorial specific to matrix-matrix multiplication that will teach you how the experts write a BLAS library call.

How to use AVX/SIMD with nested loops and += format?

I am writing a page rank program. I am writing a method for updating the rankings. I have successful got it working with nested for loops and also a threaded version. However I would like to instead use SIMD/AVX.
This is the code I would like to change into a SIMD/AVX implementation.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
temp[i] = 0.0;
for (size_t j = 0; j < npages; j++) {
temp[i] += P[j] * matrix_cap[IDX(i,j)];
}
}
For this code P[] is of size npages and matrix_cap[] is of size npages * npages. P[] is the ranks of the pages and temp[] is used to store the next iterations page ranks so as to be able to check convergence.
I don't know how to interpret += with AVX and how I would get my data which involves two arrays/vectors of size npages and one matrix of size npages * npages (in row major order) into a format of which could be used with SIMD/AVX operations.
As far as AVX this is what I have so far though it's very very incorrect and was just a stab at what I would roughly like to do.
ssize_t g_mod = npages - (npages % 4);
double* res = malloc(sizeof(double) * npages);
double sum = 0.0;
for (size_t i = 0; i < npages; i++) {
for (size_t j = 0; j < mod; j += 4) {
__m256d p = _mm256_loadu_pd(P + j);
__m256d m = _mm256_loadu_pd(matrix_hat + i + j);
__m256d pm = _mm256_mul_pd(p, m);
_mm256_storeu_pd(&res + j, pm);
for (size_t k = 0; k < 4; k++) {
sum += res[j + k];
}
}
for (size_t i = mod; i < npages; i++) {
for (size_t j = 0; j < npages; j++) {
sum += P[j] * matrix_cap[IDX(i,j)];
}
}
temp[i] = sum;
sum = 0.0;
}
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
Consider using OpenMP4.x #pragma omp simd reduction for innermost loop. Take in mind that omp reductions are not applicable to C++ arrays, therefore you have to use temporary reduction variable like shown below.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
my_type tmp_reduction = 0.0; // was: // temp[i] = 0.0;
#pragma omp simd reduction (+:tmp_reduction)
for (size_t j = 0; j < npages; j++) {
tmp_reduction += P[j] * matrix_cap[IDX(i,j)];
}
temp[i] = tmp_reduction;
}
For x86 platforms, OpenMP4.x is currently supported by fresh GCC (4.9+) and Intel Compilers. Some LLVM and PGI compilers may also support it.
P.S. Auto-vectorization ("auto" means vectorization by compiler without any pragmas, i.e. without explicit gudiance from developers) may sometimes work for some compiler variants (although it's very unlikely due to array element as reduction variable). However it is strictly speaking incorrect to auto-vectorize this code. You have to use explicit SIMD pragma to "resolve" reduction dependency and (as a good side-effect) disambiguate pointers (in case arrays are accessed via pointer).
First, EOF is right, you should see how well gcc/clang/icc do at auto-vectorizing your scalar code. I can't check for you, because you only posted code-fragments, not anything I can throw on http://gcc.godbolt.org/.
You definitely don't need to malloc anything. Notice that your intrinsics version only ever uses 32B at a time of res[], and always overwrites whatever was there before. So you might as well use a single 32B array. Or better, use a better method to get a horizontal sum of your vector.
(see the bottom for a suggestion on a different data arrangement for the matrix)
Calculating each temp[i] uses every P[j], so there is actually something to be gained from being smarter about vectorizing. For every load from P[j], use that vector with 4 different loads from matrix_cap[] for that j, but 4 different i values. You'll accumulate 4 different vectors, and have to hsum each of them down to a temp[i] value at the end.
So your inner loop will have 5 read streams (P[] and 4 different rows of matrix_cap). It will do 4 horizontal sums, and 4 scalar stores at the end, with the final result for 4 consecutive i values. (Or maybe do two shuffles and two 16B stores). (Or maybe transpose-and-sum together, which is actually a good use-case for the shuffling power of the expensive _mm256_hadd_pd (vhaddpd) instruction, but be careful of its in-lane operation)
It's probably even better to accumulate 8 to 12 temp[i] values in parallel, so every load from P[j] is reused 8 to 12 times. (check the compiler output to make sure you aren't running out of vector regs and spilling __m256d vectors to memory, though.) This will leave more work for the cleanup loop.
FMA throughput and latency are such that you need 10 vector accumulators to keep 10 FMAs in flight to saturate the FMA unit on Haswell. Skylake reduced the latency to 4c, so you only need 8 vector accumulators to saturate it on SKL. (See the x86 tag wiki). Even if you're bottlenecked on memory, not execution-port throughput, you will want multiple accumulators, but they could all be for the same temp[i] (so you'd vertically sum them down to one vector, then hsum that).
However, accumulating results for multiple temp[i] at once has the large advantage of reusing P[j] multiple times after loading it. You also save the vertical adds at the end. Multiple read streams may actually help hide the latency of a cache miss in any one of the streams. (HW prefetchers in Intel CPUs can track one forward and one reverse stream per 4k page, IIRC). You might strike a balance, and use two or three vector accumulators for each of 4 temp[i] results in parallel, if you find that multiple read streams are a problem, but that would mean you'd have to load the same P[j] more times total.
So you should do something like
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < (npages & (~7ULL)); i+=8) {
__m256d s0 = _mm256_setzero_pd(),
s1 = _mm256_setzero_pd(),
s2 = _mm256_setzero_pd(),
...
s7 = _mm256_setzero_pd(); // 8 accumulators for 8 i values
for (size_t j = 0; j < (npages & ~(3ULL)); j+=4) {
__m256d Pj = _mm256_loadu_pd(P+j); // reused 8 times after loading
//temp[i] += P[j] * matrix_cap[IDX(i,j)];
s0 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+0,j)]), s0);
s1 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+1,j)]), s1);
// ...
s7 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+7,j)]), s7);
}
// or do this block with a hsum+transpose and do vector stores.
// taking advantage of the power of vhaddpd to be doing 4 useful hsums with each instructions.
temp[i+0] = hsum_pd256(s0); // See the horizontal-sum link earlier for how to write this function
temp[i+1] = hsum_pd256(s1);
//...
temp[i+7] = hsum_pd256(s7);
// if npages isn't a multiple of 4, add the last couple scalar elements to the results of the hsum_pd256()s.
}
// TODO: cleanup for the last up-to-7 odd elements.
You could probably write __m256d sums[8] and loop over your vector accumulators, but you'd have to check that the compiler fully unrolls it and still actually keeps everything live in registers.
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
I missed this part of the question earlier. First of all, obviously float will and give you 2x the number of elements per vector (and per unit of memory bandwidth). The factor of 2 less memory / cache footprint might give more speedup than that if cache hit rate increases.
Ideally, the matrix would be "striped" to match the vector width. Every load from the matrix would get a vector of matrix_cap[IDX(i,j)] for 4 adjacent i values, but the next 32B would be the next j value for the same 4 i values. This means that each vector accumulator is accumulating the sum for a different i in each element, so no need for horizontal sums at the end.
P[j] stays linear, but you broadcast-load each element of it, for use with 8 vectors of 4 i values each (or 8 vec of 8 is for float). So you increase your reuse factor for P[j] loads by a factor of the vector width. Broadcast-loads are near-free on Haswell and later (still only take a load-port uop), and plenty cheap for this on SnB/IvB where they also take a shuffle-port uop.

Ordered Parallel code runs slower than single threading. Is there a solution?

#pragma omp parallel for default(none) shared(x) private (y, z, f) ordered
for (i = 0; i < 512; i++) {
#pragma omp ordered
for (y = 0; y < 512; y++) {
for (z = 0, f = 0; z < 512; z++) {
x[f++] = z + i + y;
}
}
}
The above code runs slower than non SMP execution by about 20%
on a dual core. Without the "#pragma omp ordered" it is about 50% faster than non SMP.
The x[f++] sequence is assumed it has to remain in an ordered form since it's similarly reused later.
Can ordered code be faster than single threading? Is there another method to achieve it?
System is win32/mingw-w64.
It's not really ordered, since the results of one iteration do not depend upon the previous, except for your use of f.
Can you derive f from i,y and z? It looks like you can. For example:
f = z + y * 512 + i * 512 * 512 + initial_f;
Now your code is unordered, and you can get real benefits from parallelization.
Single-threaded/-core code is often faster than multi-threaded/-core due to saturation of the memory system. What happens is that the memory work required by the single thread is close to or at the limit of what the memory system can deliver. Add another thread/core that requires the same work and both threads/cores will need to share what the memory system can provide resulting in wait states and slower execution
After profiling and optimization of the memory work you may reach the point where the multi-threaded code is faster. The optimization requires moving data into non-shared memory (i e L1 & L2 caches) and minimizing accesses to shared memory (L3 & RAM).
The optimization solution is more or less unique to the application at hand. It is not trivial (though some third-party SW vendors will try to say that with their product it's a piece of cake). Once you've done it you'll at least have learned what constructs should be avoided and what techniques are useful.
You are obviously relying on a shared vector x in the inner loop. So each access to that variable must be mutexed by OMP. No wonder that the "parallel" version is slower than the sequential one.
It is difficult to advise you what to change, since your code makes no sense to me at all. What do you expect the result to be? If you have ordered the final result in x will be the version for the value i set to 511. If you don't, it is whoever thread wins for each individual entry.
And what the h... is your f supposed to do? When evaluated it has the same value as w, no? This is just adding noise to make it harder to understand.

Resources