Ordered Parallel code runs slower than single threading. Is there a solution? - c

#pragma omp parallel for default(none) shared(x) private (y, z, f) ordered
for (i = 0; i < 512; i++) {
#pragma omp ordered
for (y = 0; y < 512; y++) {
for (z = 0, f = 0; z < 512; z++) {
x[f++] = z + i + y;
}
}
}
The above code runs slower than non SMP execution by about 20%
on a dual core. Without the "#pragma omp ordered" it is about 50% faster than non SMP.
The x[f++] sequence is assumed it has to remain in an ordered form since it's similarly reused later.
Can ordered code be faster than single threading? Is there another method to achieve it?
System is win32/mingw-w64.

It's not really ordered, since the results of one iteration do not depend upon the previous, except for your use of f.
Can you derive f from i,y and z? It looks like you can. For example:
f = z + y * 512 + i * 512 * 512 + initial_f;
Now your code is unordered, and you can get real benefits from parallelization.

Single-threaded/-core code is often faster than multi-threaded/-core due to saturation of the memory system. What happens is that the memory work required by the single thread is close to or at the limit of what the memory system can deliver. Add another thread/core that requires the same work and both threads/cores will need to share what the memory system can provide resulting in wait states and slower execution
After profiling and optimization of the memory work you may reach the point where the multi-threaded code is faster. The optimization requires moving data into non-shared memory (i e L1 & L2 caches) and minimizing accesses to shared memory (L3 & RAM).
The optimization solution is more or less unique to the application at hand. It is not trivial (though some third-party SW vendors will try to say that with their product it's a piece of cake). Once you've done it you'll at least have learned what constructs should be avoided and what techniques are useful.

You are obviously relying on a shared vector x in the inner loop. So each access to that variable must be mutexed by OMP. No wonder that the "parallel" version is slower than the sequential one.
It is difficult to advise you what to change, since your code makes no sense to me at all. What do you expect the result to be? If you have ordered the final result in x will be the version for the value i set to 511. If you don't, it is whoever thread wins for each individual entry.
And what the h... is your f supposed to do? When evaluated it has the same value as w, no? This is just adding noise to make it harder to understand.

Related

Data race in parallelized nested loop

I have a triple nested loop that I would like to parallelize, however, I am getting a data race issue. I am pretty sure that I need to use a reduction somehow, but I don't quite know how.
This is the loop in question:
#pragma omp parallel for simd collapse(3)
for (uint64 u = 0; u < nu; ++u) {
for (uint64 e = 0; e < ne; ++e) {
for (uint64 v = 0; v < nv; ++v) {
uAT[u][e] += _uT[u][e][v] * wA[e][v];
}
}
}
Could someone explain to me, why this causes a data race? I would really like to understand this, so that I don't run into these issues in the future. Also, can this loop be parallelized at all? If so, how?
EDIT: How do I know that there is a data race?
What this loop should accomplish (and it does in serial) is to compute the element average of a function in a Discontinuous Galerkin frame work. When I run the code a bunch of times, sometimes I get different results, eventhough it should always produce the same results. The resulting wrong values are always smaller than what the should be, which is why I assume that some values are not being added. Maybe this picture explains it better: The average in the third cell is obviously wrong (too small).
Original answer concerning multi-threading, not SIMD
By using the collapse(3) clause, the whole iteration space of nu * ne * nv iterations is distributed among threads. This means that for any combination of u and e, the v loop could be distributed among multiple threads as well. These can then access the same element uAT[u][e] in parallel, which is the data race.
As long as nu * ne is much bigger than the number of CPU cores that you work on, the easiest solution is to instead use collapse(2). As there can be inefficient implementations of collapse (See here), you might even want to leave it away completely depending on nu being big enough.
If you really need the parallelism from all three loops to efficiently use your hardware, you can use either reduction(+: uAT[0:nu][0:ne]) (add it into your existing pragma) or put a #pragma omp atomic update in front of uAT[u][e] += ...;. Which of these options is faster should be benchmarked. The reduction clause will use a lot more memory due to every thread getting its own private copy of the whole memory addressed through uAT. On the other hand the atomic update could in the worst case sequentialize part of your parallel work and give worse performance than using collapse(2).
Edit 1: SIMD
I just saw that you are also concerned with SIMD instead of just multi-threading. My original answer is about the latter. For SIMD I would first take a look at your compilers output without any OpenMP directives (e.g. on Compiler Explorer). If your compiler (with optimization and information on the target processor architecture) does already use SIMD instructions, you might not need to use OpenMP in the first place.
If you still want to use OpenMP for vectorizing your loops, leaving away the collapse clause and putting the pragma in front of the second loop would be my first ansatz. Leaving it in front of the first loop with collapse(2) might also work. Even a reduction should work but seems unnecessary complex in this context, as I would expect there to be enough parallelism in nu or nu * ne to fill your SIMD lanes. I have never used array reduction like described below in the SIMD context, so I'm not quite sure what it would do (i.e. allocating an array for each SIMD lane doesn't seem realistic), or if it even is part of the OpenMP standard (depends on the version of the standard, see here).
The way your code is written right now, the description of the data race for multi-threading technically still applies, I think. I'm not sure though if your code causes (efficient) vectorization at all, so the compiled binary might not have the data race (it might not be vectorized at all).
Edit 2: Threading + SIMD
I benchmarked a few versions of this loop nest for nu = 4, nv = 128 and ne between 1024 and 524288 (2^19). My benchmarking was done using google-benchmark (i.e. C++ instead of C, shouldn't matter here, I would think) with gcc 11.3 (always using -march=native -mtune=native, i.e. for portable performance this might not be helpful). I made sure to initialize all data in parallel (first touch policy) to avoid bad NUMA effects. I slightly modified the problem/code to use contiguous memory and do the multi-dimensional indexing manually. As OP's code didn't show the data type, I used float.
The three best versions I will share here are all three performing relatively similar. So depending on the hardware architecture there might be differences in which one is the best. For me the version inspired by Peter Cordes comments below this answer performed best:
#pragma omp parallel for
for (uint64_t e = 0UL; e < ne; ++e) {
// In the future we will get `#pragma omp unroll partial(4)`.
// The unrolling might actually not be necessary (or even a pessimization).
// So maybe leave it to the compiler.
#pragma GCC unroll 4
for (uint64_t u = 0UL; u < nu; ++u) {
float temp = 0.0f;
#pragma omp simd reduction(+ : temp)
for (uint64_t v = 0UL; v < nv; ++v) {
temp += uT[(u * ne + e) * nv + v] * wA[e * nv + v];
}
uAT[u * ne + e] += temp;
}
}
One could also have a float temp[nu]; and put the unrolled u loop inside the v loop to get even nearer to Peter Cordes' description, but then on would have to use an array reduction as described above. These array reductions consistently caused stack overflows for me, so I settled on this version which depends on nv being small enough that wA can still be cached between u iterations.
This second version just differs in the u loop staying on the outside:
#pragma omp parallel
for (uint64_t u = 0UL; u < nu; ++u) {
#pragma omp for
for (uint64_t e = 0UL; e < ne; ++e) {
float temp = 0.0f;
#pragma omp simd reduction(+ : temp)
for (uint64_t v = 0UL; v < nv; ++v) {
temp += uT[(u * ne + e) * nv + v] * wA[e * nv + v];
}
uAT[u * ne + e] += temp;
}
}
The third version uses collapse(2):
#pragma omp parallel for collapse(2)
for (uint64_t u = 0UL; u < nu; ++u) {
for (uint64_t e = 0UL; e < ne; ++e) {
float temp = 0.0f;
#pragma omp simd reduction(+ : temp)
for (uint64_t v = 0UL; v < nv; ++v) {
temp += uT[(u * ne + e) * nv + v] * wA[e * nv + v];
}
uAT[u * ne + e] += temp;
}
}
TL;DR:
I think the most important points can be seen in all three versions:
#pragma omp parallel for simd is nice for big simple loops. If you have a loop nest, you probably should split the pragma up.
Use simd on a loop which accesses contiguous elements in contiguous iterations.
By respecting the first two points, you don't need the possibly expensive array reduction.
By using a temporary (i.e. register) for the reduction instead of writing back to memory, you make using OpenMP easier and probably have better performance in serial code as well. Due to floating point non-associativity, this optimization can't be done by most compilers without allowing it via e.g. -ffast-math (gcc).
Most compilers can not vectorize the reduction on their own for the same reason.
Details like Using collapse or the the order of the e and the u loop are smaller optimizations which are not as important as long as you provide enough parallelism. I.e. don't parallelize over just u if nu is small (as the question author wrote below this answer).

OpenMP parallelize grouped array sum using pointers

I want to effectively parallelize the following sum in C:
#pragma omp parallel for num_threads(nth)
for(int i = 0; i < l; ++i) pout[pg[i]] += px[i];
where px is a pointer to a double array x of size l containing some data, pg is a pointer to an integer array g of size l that assigns each data point in x to one of ng groups which occur in a random order, and pout is a pointer to a double array out of size ng which is initialized with zeros and contains the result of summing x over the grouping defined by g.
The code above works, but the performance is not optimal so I wonder if there is somewthing I can do in OpenMP (such as a reduction() clause) to improve the execution. The dimensions l and ng of the arrays, and the number of threads nth are available to me and fixed beforehand. I cannot directly access the arrays, only the pointers are passed to a function which does the parallel sum.
Your code has a data race (at line pout[pg[i]] += ...), you should fix it first, then worry about its performance.
if ng is not too big and you use OpenMP 4.5+, the most efficient solution is using reduction: #pragma omp parallel for num_threads(nth) reduction(+:pout[:ng])
if ng is too big, most probably the best idea is to use a serial version of the program on PCs. Note that your code will be correct by adding #pragma omp atomic before pout[pg[i]] += .., but its performance is questionable.
From your description it sounds like you have a many-to-few mapping. That is a big problem for parallelism because you likely have write conflicts in the target array. Attempts to control with critical sections or locks will probably only slow down the code.
Unless it is prohibitive in memory, I would give each thread a private copy of pout and sum into that, then add those copies together. Now the reading of the source array can be nicely divided up between the threads. If the pout array is not too large, your speedup should be decent.
Here is the crucial bit of code:
#pragma omp parallel shared(sum,threadsum)
{
int thread = omp_get_thread_num(),
myfirst = thread*ngroups;
#pragma omp for
for ( int i=0; i<biglen; i++ )
threadsum[ myfirst+indexes[i] ] += 1;
#pragma omp for
for ( int igrp=0; igrp<ngroups; igrp++ )
for ( int t=0; t<nthreads; t++ )
sum[igrp] += threadsum[ t*ngroups+igrp ];
}
Now for the tricky bit. I'm using an index array of size 100M, but the number of groups is crucial. With 5000 groups I get good speedup, but with only 50, even though I've eliminated things like false sharing, I get pathetic or no speedup. This is not clear to me yet.
Final word: I also coded #Laci's solution of just using a reduction. Testing on 1M groups output: For 2-8 threads the reduction solution is actually faster, but for higher thread counts I win by almost a factor of 2 because the reduction solution repeatedly adds the whole array while I sum it just once, and then in parallel. For smaller numbers of groups the reduction is probably preferred overall.

Efficient Parallel algorithm for array filtering

Given a very large array I want to select only the elements that match some condition. I know a priori the number of elements that will be matched. My current pseucode is:
filter(list):
out = list of predetermined size
i = 0
for element in list
if element matches condition
out[i++] = element
return out
When trying to parallelize the previous algorithm, my naive approach was to just make the increment of i atomic (It's part of an OpenMP project so used #pragma omp atomic). However this implementation has slowed performance when compared to even the serial implementation. What more efficient algorithms are there to implement this? And why am I getting such heavy slowdown?
Information from the comments:
I'm using C with OpenMP;
Currently testing with one million entries;
Serial takes ~7 seconds, Parallel with two threads takes ~15;
The output array size it's exactly half (the problem is equivalent to finding the elements of the array smaller than the median of said array);
I'm testing it with two cores only.
However this implementation has slowed performance when compared to
even the serial implementation. What more efficient algorithms are
there to implement this?
The bottleneck is the overhead of atomic operation, therefore at first glance a more efficient algorithm would be one that would avoid the use of such operation. Although that is possible, the code would require a two step approach for instance each thread would save the elements that it has find out in a private array. After the parallel region the main thread would collect all the elements that each thread have found and merge them into a single array.
The second part of merging into a single array can be made parallel as well or using SIMD instructions.
I have created a small code to mimic your pseudocode:
#include <time.h>
#include <omp.h>
int main ()
{
int array_size = 1000000;
int *a = malloc(sizeof(int) * array_size);
int *out = malloc(sizeof(int) * array_size/2);
for(int i = 0; i < array_size; i++)
a[i] = i;
double start = omp_get_wtime();
int i = 0;
#pragma omp parallel for
for (int n=0 ; n < array_size; ++n ){
if(a[n] % 2 == 0){
int tmp;
#pragma omp atomic capture
tmp = i++;
out[tmp] = a[n];
}
}
double end = omp_get_wtime();
printf("%f\n",end-start);
free(a);
free(out);
return 0;
}
I did an ad hoc benchmark with the following results:
-> Sequential : 0.001597 (s)
-> 2 Threads : 0.017891 (s)
-> 4 Threads : 0.015254 (s)
So the parallel version is much much slower, which is expectable because the work performed in parallel is simply not enough to overcome the overhead of the atomic and of the parallelism.
I have tested also removing the atomic and leaving the race-condition just to check the time:
-> Sequential : 0.001597 (s)
-> 2 Threads : 0.001283 (s)
-> 4 Threads : 0.000720 (s)
So without the atomic the speedup was approximately 1.2 and 2.2 for 2 and 4 threads, respectively. So naturally the atomic is causing a huge overhead. Nonetheless, the speedups are not great even without any atomic. And this is the most that you can expect from parallelizing your code alone.
Depending in your real code, and on how computation demanding is your condition, you may not achieve great speedups even with a second approach that I have mentioned.
A useful note from the comments from #Paul G:
Even if it isn't going to speed up this particular example, it might
be useful to state that problems like this are generally solved in
parallel computing using a (parallel) exclusive scan algorithm/prefix
sum to determine which thread starts at which index of the output
array. Then every thread has a private output index it is incrementing
and one doesn't need atomics.
That approach might be slower than #dreamcrashs solution in this
particular case but is more general as having private arrays for each
thread can be very tricky (determining their size etc.) especially if
your output is bigger than the input (case where each input element
doesnt give 0 or 1 output element, but n output elements).

Convert sequential loop into parallel in C using pthreads

I would like to apply a pretty simple straightforward calculation on a n-by-d-dimensional array. The goal is to convert the sequential calculation to a parallel one using pthreads. My question is: what is the optimal way to split the problem? How could I significantly reduce the execution time of my script? I provide a sample sequential code in C and some thoughts on parallel implementations that I have already tried.
double * calcDistance(double * X ,int n, int d)
{
//calculate and return an array[n-1] of all the distances
//from the last point
double *distances = calloc(n,sizeof(double));
for(int i=0 ; i<n-1; i++)
{
//distances[i]=0;
for (int j=0; j< d; j++)
{
distances[i] += pow(X[(j+1)*n-1]-X[j*n+i], 2);
}
distances[i] = sqrt(distances[i]);
}
return distances;
}
I provide a main()-caller function in order for the sample to be complete and testable:
#include <stdio.h>
#include <stdlib.h>
#define N 10 //00000
#define D 2
int main()
{
srand(time(NULL));
//allocate the proper space for X
double *X = malloc(D*N*(sizeof(double)));
//fill X with numbers in space (0,1)
for(int i = 0 ; i<N ; i++)
{
for(int j=0; j<D; j++)
{
X[i+j*N] = (double) (rand() / (RAND_MAX + 2.0));
}
}
X = calcDistances(X, N, D);
return 0;
}
I have already tried utilizing pthreads asynchronously through the use of a global_index that is imposed to mutex and a local_index. Through the use of a while() loop, a local_index is assigned to each thread on each iteration. The local_index assignment depends on the global_index value at that time (both happening in a mutual exclusion block). The thread executes the computation on the distances[local_index] element.
Unfortunately this implementation has lead to a much slower program with a x10 or x20 bigger execution time compared to the sequential one that is cited above.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
Your inner loop jumps all over array X with a mixture of strides that varies with
the outer-loop iteration. Unless n and d are quite small,* this is likely to produce poor cache usage -- in the serial code, too, but parallelizing would amplify that effect. At least X is not written by the function, which improves the outlook. Also, there do not appear to be any data dependencies across iterations of the outer loop, which is good.
what is the optimal way to split the problem?
Probably the best available way would be to split outer-loop iterations among your threads. For T threads, have one perform iterations 0 ... (N / T) - 1, have the second do (N / T) ... (2 * N / T) - 1, etc..
How could I significantly reduce the execution time of my script?
The first thing I would do is use simple multiplication instead of pow to compute squares. It's unclear whether you stand to gain anything from parallelism.
I have already tried utilizing pthreads asynchronously through the use
of a global_index that is imposed to mutex and a local_index. [...]
If you have to involve a mutex, semaphore, or similar synchronization object then the task is probably hopeless. Happily (maybe) there does not appear to be any need for that. Assigning outer-loop iterations to threads dynamically is way over-engineered for this problem. Statically assigning iterations to threads as I already described will remove the need for such synchronization, and since the cost of the inner loop does not look like it will vary much for different outer-loop iterations, there probably will not be too much inefficiency introduced that way.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
This sounds like what I described. It is one of the standard scheduling models provided by OMP, and one of the most efficient available for many problems, given that it does not itself require a mutex. It is somewhat sensitive to the relationship between the number of threads and the number of available execution units, however. For example, if you parallelize across five cores in a four-core machine, then one will have to wait to run until one of the others has finished -- best theoretical speedup 60%. Parallelizing the same computation across only four cores uses the compute resources more efficiently, for a best theoretical speedup of about 75%.
* If n and d are quite small, say anything remotely close to the values in the example driver program, then the overhead arising from parallelization has a good chance of overcoming any gains from parallel execution.

Segmentation fault when trying to use intrinsics specifically _mm256_storeu_pd()

Seemed to have fixed it myself by type casting the cij2 pointer inside the mm256 call
so _mm256_storeu_pd((double *)cij2,vecC);
I have no idea why this changed anything...
I'm writing some code and trying to take advantage of the Intel manual vectorization. But whenever I run the code I get a segmentation fault on trying to use my double *cij2.
if( q == 0)
{
__m256d vecA;
__m256d vecB;
__m256d vecC;
for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
{
double cij = C[i+j*lda];
double *cij2 = (double *)malloc(4*sizeof(double));
for (int k = 0; k < K; k+=4)
{
vecA = _mm256_load_pd(&A[i+k*lda]);
vecB = _mm256_load_pd(&B[k+j*lda]);
vecC = _mm256_mul_pd(vecA,vecB);
_mm256_storeu_pd(cij2, vecC);
for (int x = 0; x < 4; x++)
{
cij += cij2[x];
}
}
C[i+j*lda] = cij;
}
I've pinpointed the problem to the cij2 pointer. If i comment out the 2 lines that include that pointer the code runs fine, it doesn't work like it should but it'll actually run.
My question is why would i get a segmentation fault here? I know I've allocated the memory correctly and that the memory is a 256 vector of double's with size 64 bits.
After reading the comments I've come to add some clarification.
First thing I did was change the _mm_malloc to just a normal allocation using malloc. Shouldn't affect either way but will give me some more breathing room theoretically.
Second the problem isn't coming from a null return on the allocation, I added a couple loops in to increment through the array and make sure I could modify the memory without it crashing so I'm relatively sure that isn't the problem. The problem seems to stem from the loading of the data from vecC to the array.
Lastly I can not use BLAS calls. This is for a parallelisms class. I know it would be much simpler to call on something way smarter than I but unfortunately I'll get a 0 if I try that.
You dynamically allocate double *cij2 = (double *)malloc(4*sizeof(double)); but you never free it. This is just silly. Use double cij2[4], especially if you're not going to bother to align it. You never need more than one scratch buffer at once, and it's a small fixed size, so just use automatic storage.
In C++11, you'd use alignas(32) double cij2[4] so you could use _mm256_store_pd instead of storeu. (Or just to make sure storeu isn't slowed down by an unaligned address).
If you actually want to debug your original, use a debugger to catch it when it segfaults, and look at the pointer value. Make sure it's something sensible.
Your methods for testing that the memory was valid (like looping over it, or commenting stuff out) sound like they could lead to a lot of your loop being optimized away, so the problem wouldn't happen.
When your program crashes, you can also look at the asm instructions. Vector intrinsics map fairly directly to x86 asm (except when the compiler sees a more efficient way).
Your implementation would suck a lot less if you pulled the horizontal sum out of the loop over k. Instead of storing each multiply result and horizontally adding it, use a vector add into a vector accumulator. hsum it outside the loop over k.
__m256d cij_vec = _mm256_setzero_pd();
for (int k = 0; k < K; k+=4) {
vecA = _mm256_load_pd(&A[i+k*lda]);
vecB = _mm256_load_pd(&B[k+j*lda]);
vecC = _mm256_mul_pd(vecA,vecB);
cij_vec = _mm256_add_pd(cij_vec, vecC); // TODO: use multiple accumulators to keep multiple VADDPD or VFMAPD instructions in flight.
}
C[i+j*lda] = hsum256_pd(cij_vec); // put the horizontal sum in an inline function
For good hsum256_pd implementations (other than storing to memory and using a scalar loop), see Fastest way to do horizontal float vector sum on x86 (I included an AVX version there. It should be easy to adapt the pattern of shuffling to 256b double-precision.) This will help your code a lot, since you still have O(N^2) horizontal sums (but not O(N^3) with this change).
Ideally you could accumulate results for 4 i values in parallel, and not need horizontal sums.
VADDPD has a latency of 3 to 4 clocks, and a throughput of one per 1 to 0.5 clocks, so you need from 3 to 8 vector accumulators to saturate the execution units. Or with FMA, up to 10 vector accumulators (e.g. on Haswell where FMA...PD has 5c latency and one per 0.5c throughput). See Agner Fog's instruction tables and optimization guides to learn more about that. Also the x86 tag wiki.
Also, ideally nest your loops in a way that gave you contiguous access to two of your three arrays, since cache access patterns are critical for matmul (lots of data reuse). Even if you don't get fancy and transpose small blocks at a time that fit in cache. Even transposing one of your input matrices can be a win, since that costs O(N^2) and speeds up the O(N^3) process. I see your inner loop currently has a stride of lda while accessing A[].

Resources