OpenMP parallelize grouped array sum using pointers - c

I want to effectively parallelize the following sum in C:
#pragma omp parallel for num_threads(nth)
for(int i = 0; i < l; ++i) pout[pg[i]] += px[i];
where px is a pointer to a double array x of size l containing some data, pg is a pointer to an integer array g of size l that assigns each data point in x to one of ng groups which occur in a random order, and pout is a pointer to a double array out of size ng which is initialized with zeros and contains the result of summing x over the grouping defined by g.
The code above works, but the performance is not optimal so I wonder if there is somewthing I can do in OpenMP (such as a reduction() clause) to improve the execution. The dimensions l and ng of the arrays, and the number of threads nth are available to me and fixed beforehand. I cannot directly access the arrays, only the pointers are passed to a function which does the parallel sum.

Your code has a data race (at line pout[pg[i]] += ...), you should fix it first, then worry about its performance.
if ng is not too big and you use OpenMP 4.5+, the most efficient solution is using reduction: #pragma omp parallel for num_threads(nth) reduction(+:pout[:ng])
if ng is too big, most probably the best idea is to use a serial version of the program on PCs. Note that your code will be correct by adding #pragma omp atomic before pout[pg[i]] += .., but its performance is questionable.

From your description it sounds like you have a many-to-few mapping. That is a big problem for parallelism because you likely have write conflicts in the target array. Attempts to control with critical sections or locks will probably only slow down the code.
Unless it is prohibitive in memory, I would give each thread a private copy of pout and sum into that, then add those copies together. Now the reading of the source array can be nicely divided up between the threads. If the pout array is not too large, your speedup should be decent.
Here is the crucial bit of code:
#pragma omp parallel shared(sum,threadsum)
{
int thread = omp_get_thread_num(),
myfirst = thread*ngroups;
#pragma omp for
for ( int i=0; i<biglen; i++ )
threadsum[ myfirst+indexes[i] ] += 1;
#pragma omp for
for ( int igrp=0; igrp<ngroups; igrp++ )
for ( int t=0; t<nthreads; t++ )
sum[igrp] += threadsum[ t*ngroups+igrp ];
}
Now for the tricky bit. I'm using an index array of size 100M, but the number of groups is crucial. With 5000 groups I get good speedup, but with only 50, even though I've eliminated things like false sharing, I get pathetic or no speedup. This is not clear to me yet.
Final word: I also coded #Laci's solution of just using a reduction. Testing on 1M groups output: For 2-8 threads the reduction solution is actually faster, but for higher thread counts I win by almost a factor of 2 because the reduction solution repeatedly adds the whole array while I sum it just once, and then in parallel. For smaller numbers of groups the reduction is probably preferred overall.

Related

Efficient Parallel algorithm for array filtering

Given a very large array I want to select only the elements that match some condition. I know a priori the number of elements that will be matched. My current pseucode is:
filter(list):
out = list of predetermined size
i = 0
for element in list
if element matches condition
out[i++] = element
return out
When trying to parallelize the previous algorithm, my naive approach was to just make the increment of i atomic (It's part of an OpenMP project so used #pragma omp atomic). However this implementation has slowed performance when compared to even the serial implementation. What more efficient algorithms are there to implement this? And why am I getting such heavy slowdown?
Information from the comments:
I'm using C with OpenMP;
Currently testing with one million entries;
Serial takes ~7 seconds, Parallel with two threads takes ~15;
The output array size it's exactly half (the problem is equivalent to finding the elements of the array smaller than the median of said array);
I'm testing it with two cores only.
However this implementation has slowed performance when compared to
even the serial implementation. What more efficient algorithms are
there to implement this?
The bottleneck is the overhead of atomic operation, therefore at first glance a more efficient algorithm would be one that would avoid the use of such operation. Although that is possible, the code would require a two step approach for instance each thread would save the elements that it has find out in a private array. After the parallel region the main thread would collect all the elements that each thread have found and merge them into a single array.
The second part of merging into a single array can be made parallel as well or using SIMD instructions.
I have created a small code to mimic your pseudocode:
#include <time.h>
#include <omp.h>
int main ()
{
int array_size = 1000000;
int *a = malloc(sizeof(int) * array_size);
int *out = malloc(sizeof(int) * array_size/2);
for(int i = 0; i < array_size; i++)
a[i] = i;
double start = omp_get_wtime();
int i = 0;
#pragma omp parallel for
for (int n=0 ; n < array_size; ++n ){
if(a[n] % 2 == 0){
int tmp;
#pragma omp atomic capture
tmp = i++;
out[tmp] = a[n];
}
}
double end = omp_get_wtime();
printf("%f\n",end-start);
free(a);
free(out);
return 0;
}
I did an ad hoc benchmark with the following results:
-> Sequential : 0.001597 (s)
-> 2 Threads : 0.017891 (s)
-> 4 Threads : 0.015254 (s)
So the parallel version is much much slower, which is expectable because the work performed in parallel is simply not enough to overcome the overhead of the atomic and of the parallelism.
I have tested also removing the atomic and leaving the race-condition just to check the time:
-> Sequential : 0.001597 (s)
-> 2 Threads : 0.001283 (s)
-> 4 Threads : 0.000720 (s)
So without the atomic the speedup was approximately 1.2 and 2.2 for 2 and 4 threads, respectively. So naturally the atomic is causing a huge overhead. Nonetheless, the speedups are not great even without any atomic. And this is the most that you can expect from parallelizing your code alone.
Depending in your real code, and on how computation demanding is your condition, you may not achieve great speedups even with a second approach that I have mentioned.
A useful note from the comments from #Paul G:
Even if it isn't going to speed up this particular example, it might
be useful to state that problems like this are generally solved in
parallel computing using a (parallel) exclusive scan algorithm/prefix
sum to determine which thread starts at which index of the output
array. Then every thread has a private output index it is incrementing
and one doesn't need atomics.
That approach might be slower than #dreamcrashs solution in this
particular case but is more general as having private arrays for each
thread can be very tricky (determining their size etc.) especially if
your output is bigger than the input (case where each input element
doesnt give 0 or 1 output element, but n output elements).

Matrix Multiplication using OpenMP (C) - Collapsing all the loops

So I was learning about the basics OpenMP in C and work-sharing constructs, particularly for loop. One of the most famous examples used in all the tutorials is of matrix multiplication but all of them just parallelize the outer loop or the two outer loops. I was wondering why we do not parallelize and collapse all the 3 loops (using atomic) as I have done here:
for(int i=0;i<100;i++){
//Initialize the arrays
for(int j=0;j<100;j++){
A[i][j] = i;
B[i][j] = j;
C[i][j] = 0;
}
}
//Starting the matrix multiplication
#pragma omp parallel num_threads(4)
{
#pragma omp for collapse(3)
for(int i=0;i<100;i++){
for(int j=0;j<100;j++){
for(int k=0;k<100;k++){
#pragma omp atomic
C[i][j] = C[i][j]+ (A[i][k]*B[k][j]);
}
}
}
}
Can you tell me what I am missing here or why is this not an inferior/superior solution?
Atomic operations are very costly on most architectures compared to non-atomic ones (see here to understand why or here for a more detailed analysis). This is especially true when many threads make concurrent accesses to the same shared memory area. To put it simply, one cause is that threads performing atomic operations cannot fully run in parallel without waiting others on most hardware due to implicit synchronizations and communications coming from the cache coherence protocol. Another source of slowdowns is the high-latency of atomic operations (again due to the cache hierarchy).
If you want to write code that scale well, you need to minimize synchronizations and communications (including atomic operations).
As a result, using collapse(2) is much better than a collapse(3). But this is not the only issue is your code. Indeed, to be efficient you must perform memory accesses continuously and keep data in caches as much as possible.
For example, swapping the loop iterating over i and the one iterating over k (that does not work with collapse(2)) is several times faster on most systems due to more contiguous memory accesses (about 8 times on my PC):
for(int i=0;i<100;i++){
//Initialize the arrays
for(int j=0;j<100;j++){
A[i][j] = i;
B[i][j] = j;
C[i][j] = 0;
}
}
//Starting the matrix multiplication
#pragma omp parallel num_threads(4)
{
#pragma omp for
for(int i=0;i<100;i++){
for(int k=0;k<100;k++){
for(int j=0;j<100;j++){
C[i][j] = C[i][j] + (A[i][k]*B[k][j]);
}
}
}
}
Writing fast matrix-multiplication code is not easy. Consider using BLAS libraries such as OpenBLAS, ATLAS, Eigen or Intel MKL rather than writing your own code if your goal is to use this in production code. Indeed, such libraries are very optimized and often scale well on many cores.
If your goal is to understand how to write efficient matrix-multiplication codes, a good starting point may be to read this tutorial.
Collapsing loops requires that you know what you are doing as it may result in very cache-unfriendly splits of the iteration space or introduce data dependencies depending on how the product of the loop counts relates to the number of threads.
Imagine the following constructed example, which is not that uncommon actually (the loop counts are small just to illustrate the point):
for (int i = 0; i < 7; i++)
for (int j = 0; j < 3; j++)
a[i] += b[i][j];
If you parallelise the outer loop, three threads get two iterations and one thread gets just one, but all of them do all the iterations of the inner loop:
---0-- ---1-- ---2-- -3- (thread number)
000111 222333 444555 666 (values of i)
012012 012012 012012 012 (values of j)
Each a[i] gets processed by one thread only. Smart compilers may implement the inner loop using register optimisation, accumulating the values in a register first and only assigning to a[i] at the very end, and it will run very fast.
If you collapse the two loops, you end up in a very different situation. Since there is a total of 7x3 = 21 iterations now, the default split will be (depending on the compiler and the OpenMP runtime, but most of them do this) five iterations per thread and one gets six iterations:
--0-- --1-- --2-- ---3-- (thread number)
00011 12223 33444 555666 (values of i)
01201 20120 12012 012012 (values of j)
As you can see, now a[1] is processed by both thread 0 and thread 1. Similarly, a[3] is processed by both thread 1 and thread 2. And there you have it - you just introduced a data dependency that wasn't there in the previous case, so now you have to use atomic in order to prevent data races. That price that you pay for synchronisation is way higher than doing one iteration more or less! In your case, if you only collapse the two outer loops, you won't need to use atomic at all (although, in your particular case, 4 divides 100 and even when collapsing all the loops together you don't need the atomic construct, but you need it in the general case).
Another issue is that after collapsing the loops, there is a single loop index and both i and j indices have to be reconstructed from this new index using division and modulo operations. For simple loop bodies like yours, the overhead of reconstructing the indices may be simply too high.
There are very few good reasons not to use a library for matrix-matrix multiplication, so as suggested already, please call BLAS instead of writing this yourself. Nonetheless, the questions you ask are not specific to matrix-matrix multiplication, so they deserve to be answered anyways.
There are a few things that can be improved here:
Use contiguous memory.
If K is the innermost loop, you are doing dot-products, which are harder to vectorize. The loop order IKJ will vectorize better, for example.
If you want to parallelize a dot product with OpenMP, use a reduction instead of many atomics.
I have illustrated each of these techniques independently below.
Contiguous memory
int n = 100;
double * C = malloc(n*n*sizeof(double));
for(int i=0;i<n;i++){
for(int j=0;j<n;j++){
C[i*n+j] = 0.0;
}
}
IKJ loop ordering
for(int i=0;i<100;i++){
for(int k=0;k<100;k++){
for(int j=0;j<100;j++){
C[i][j] = C[i][j]+ (A[i][k]*B[k][j]);
}
}
}
Parallel dot-product
double x = 0;
#pragma omp parallel for reduction(+:x)
for(int k=0;k<100;k++){
x += (A[i][k]*B[k][j]);
}
C[i][j] += x;
External resources
How to Write Fast Numerical Code:
A Small Introduction covers these topics in far more detail.
BLISlab is an excellent tutorial specific to matrix-matrix multiplication that will teach you how the experts write a BLAS library call.

Convert sequential loop into parallel in C using pthreads

I would like to apply a pretty simple straightforward calculation on a n-by-d-dimensional array. The goal is to convert the sequential calculation to a parallel one using pthreads. My question is: what is the optimal way to split the problem? How could I significantly reduce the execution time of my script? I provide a sample sequential code in C and some thoughts on parallel implementations that I have already tried.
double * calcDistance(double * X ,int n, int d)
{
//calculate and return an array[n-1] of all the distances
//from the last point
double *distances = calloc(n,sizeof(double));
for(int i=0 ; i<n-1; i++)
{
//distances[i]=0;
for (int j=0; j< d; j++)
{
distances[i] += pow(X[(j+1)*n-1]-X[j*n+i], 2);
}
distances[i] = sqrt(distances[i]);
}
return distances;
}
I provide a main()-caller function in order for the sample to be complete and testable:
#include <stdio.h>
#include <stdlib.h>
#define N 10 //00000
#define D 2
int main()
{
srand(time(NULL));
//allocate the proper space for X
double *X = malloc(D*N*(sizeof(double)));
//fill X with numbers in space (0,1)
for(int i = 0 ; i<N ; i++)
{
for(int j=0; j<D; j++)
{
X[i+j*N] = (double) (rand() / (RAND_MAX + 2.0));
}
}
X = calcDistances(X, N, D);
return 0;
}
I have already tried utilizing pthreads asynchronously through the use of a global_index that is imposed to mutex and a local_index. Through the use of a while() loop, a local_index is assigned to each thread on each iteration. The local_index assignment depends on the global_index value at that time (both happening in a mutual exclusion block). The thread executes the computation on the distances[local_index] element.
Unfortunately this implementation has lead to a much slower program with a x10 or x20 bigger execution time compared to the sequential one that is cited above.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
Your inner loop jumps all over array X with a mixture of strides that varies with
the outer-loop iteration. Unless n and d are quite small,* this is likely to produce poor cache usage -- in the serial code, too, but parallelizing would amplify that effect. At least X is not written by the function, which improves the outlook. Also, there do not appear to be any data dependencies across iterations of the outer loop, which is good.
what is the optimal way to split the problem?
Probably the best available way would be to split outer-loop iterations among your threads. For T threads, have one perform iterations 0 ... (N / T) - 1, have the second do (N / T) ... (2 * N / T) - 1, etc..
How could I significantly reduce the execution time of my script?
The first thing I would do is use simple multiplication instead of pow to compute squares. It's unclear whether you stand to gain anything from parallelism.
I have already tried utilizing pthreads asynchronously through the use
of a global_index that is imposed to mutex and a local_index. [...]
If you have to involve a mutex, semaphore, or similar synchronization object then the task is probably hopeless. Happily (maybe) there does not appear to be any need for that. Assigning outer-loop iterations to threads dynamically is way over-engineered for this problem. Statically assigning iterations to threads as I already described will remove the need for such synchronization, and since the cost of the inner loop does not look like it will vary much for different outer-loop iterations, there probably will not be too much inefficiency introduced that way.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
This sounds like what I described. It is one of the standard scheduling models provided by OMP, and one of the most efficient available for many problems, given that it does not itself require a mutex. It is somewhat sensitive to the relationship between the number of threads and the number of available execution units, however. For example, if you parallelize across five cores in a four-core machine, then one will have to wait to run until one of the others has finished -- best theoretical speedup 60%. Parallelizing the same computation across only four cores uses the compute resources more efficiently, for a best theoretical speedup of about 75%.
* If n and d are quite small, say anything remotely close to the values in the example driver program, then the overhead arising from parallelization has a good chance of overcoming any gains from parallel execution.

OpenMP to write concurrently in the same block of memory

Assume to have two 3D arrays: A and B, with different number of elements each. I do some operations with values of A that correspond to some values of B with different indices.
For example : I use A[i][j][k] to calculate some quantities. Since each calculation is independent, I can do this using parallel for with no problem. But the updated value are used to increase the values of some positions of B array.
For example :
A[i][j][k]->C(a number)->B[l][m][n]. But at the same time a lot of writes could occur to B[l][m][n]. I use B[l][m][n]+=c to update B elements. Of course I cannot use OpenMP here because I violate the independence of loops. And neither do I know a priori the indices l,l,m in order to group them in buffer writes. Any good ideas on how to implement it in parallel? Maybe critical or atomic would benefit me here but I don't know how
A simplified version of my problem (1D)
for(int i=0,i<size_A,i++)
{
//Some code here to update A[i]. Each update is independent.
}
for(int j=0,j<size_A,j++)
{
//From A[j] B[m],B[m+1] are evaluated
int m=A[j]/dx;
double c;//Make some calculations
B[m] += c;
B[m+1] += c*c;
}
There are two ways to do this. Either you make each update atomic. This just looks like:
#pragma omp atomic update
B[m] += c;
Or each threads gets a private update-version of B and after the loop, the private copies are all safely added together. This is possible by using:
#pragma omp parallel for reduction(+:B)
for (...)
The latter requires OpenMP 4.5, but there are workarounds for earlier versions.
Which one is better for you depends on the rate of updates and size of the matrix. You have to measure both to be sure.

Parallelizing nested loop in OpenMP using #pragma parallel for shared

I'm trying to parallelize a code. My code looks like this -
#pragma omp parallel private(i,j,k)
#pragma omp parallel for shared(A)
for(k=0;k<100;<k++)
for(i=1;i<1024;<i++)
for(j=0;j<1024;<j++)
A[i][j+1]=<< some expression involving elements of A[i-1][j-1] >>
On executing this code I'm getting a different result from serial execution of the loops.
I'm unable to understand what I'm doing wrong.
I've also tried the collapse()
#pragma omp parallel private(i,j,k)
#pragma omp parallel for collapse(3) shared(A)
for(k=0;k<100;<k++)
for(i=1;i<1024;<i++)
for(j=0;j<1024;<j++)
A[i][j+1]=<< some expression involving elements of A[][] >>
Another thing I tried was having a #pragma omp parallel for before each loop instead of collapse().
The issue, as I think, is the data dependency. Any idea how to parallelize in case of data dependency?
If this is really your use case, just parallelize for the outer loop, k, this should largely suffice for the modest parallelism that you have on common architectures.
If you want more, you'd have to re-write your loops such that you have an inner part that doesn't have the dependency. In your example case this is relatively easy, you'd have to process by "diagonals" (outer loop, sequential) and then inside the diagonals you'd be independent.
for (size_t d=0; d<nDiag(100); ++d) {
size_t nPoints = somefunction(d);
#pragma omp parallel
for (size_t p=0; p<nPoints; ++p) {
size_t x = coX(p, d);
size_t y = coY(p, d);
... your real code ...
}
}
Part of this could be done automatically, but I don't think that such tools are already readily implemented in everydays OMP. This is an active line of research.
Also note the following
int is rarely a good idea for indices, in particular if you access matrices. If you have to compute the absolute position of an entry yourself (and you see that here you might be) this overflows easily. int usually is 32 bit wide and of these 32 you are even wasting one for the sign. In C, object sizes are computed with size_t, most of the times 64 bit wide and in any case the correct type chosen by your platform designer.
use local variables for loop indices and other temporaries, as you can see writing OMP pragmas becomes much easier, then. Locality is one key to parallelism. Help yourself and the compiler by expressing this correctly.
You're only parallelizing the outer 'k' for loop. Every parallel thread is executing the 'i' and 'j' loops, and they're all writing into the same 'A' result. Since they're all reading and writing the same slots in A, the final result will be non-deterministic.
It's not clear from your problem that any parallelism is possible, since each step seems to depend on every previous step.

Resources