I have an OpenMP program that calculates variables in a loop.
int a[1000000];
int b[1000000];
int c[1000000];
int d[1000000];
#pragma omp parallel for private(i) shared(a,b,c,d)
for (i=0;i<1000000;++i)
{
d[i] = b[i]*a[i] + c[i]+10;
}
I used perf and perf said the bottleneck is in memory reading and writing.
First question: Is it possible to split the arrays A,B,C,D and put them into different memory banks using OpenMP?
Second question: If I split the arrays A,B,C,D into smaller arrays, will they speed up the execution of the loop?
First question: Is it possible to split the arrays A,B,C,D and put them into different memory banks using OpenMP?
By my knowledge, no, you cannot split arrays explicitly with OpenMP runtime functions. What OpenMP do is to attribute equal load of work to each thread. Let's say you have defined OMP_NUM_THREADS=10, then each thread will receive 100000 iterations to execute. Or last thread will receive less work if remainder not equals to 0. (What you can do is assign thread affinity to cores.)
Second question: If I split the arrays A,B,C,D into smaller arrays, will they speed up the execution of the loop?
No. But you can increase the number of threads to obtain better performance (until at a point that threads' overheads overtake the acceleration)
Related
I want to effectively parallelize the following sum in C:
#pragma omp parallel for num_threads(nth)
for(int i = 0; i < l; ++i) pout[pg[i]] += px[i];
where px is a pointer to a double array x of size l containing some data, pg is a pointer to an integer array g of size l that assigns each data point in x to one of ng groups which occur in a random order, and pout is a pointer to a double array out of size ng which is initialized with zeros and contains the result of summing x over the grouping defined by g.
The code above works, but the performance is not optimal so I wonder if there is somewthing I can do in OpenMP (such as a reduction() clause) to improve the execution. The dimensions l and ng of the arrays, and the number of threads nth are available to me and fixed beforehand. I cannot directly access the arrays, only the pointers are passed to a function which does the parallel sum.
Your code has a data race (at line pout[pg[i]] += ...), you should fix it first, then worry about its performance.
if ng is not too big and you use OpenMP 4.5+, the most efficient solution is using reduction: #pragma omp parallel for num_threads(nth) reduction(+:pout[:ng])
if ng is too big, most probably the best idea is to use a serial version of the program on PCs. Note that your code will be correct by adding #pragma omp atomic before pout[pg[i]] += .., but its performance is questionable.
From your description it sounds like you have a many-to-few mapping. That is a big problem for parallelism because you likely have write conflicts in the target array. Attempts to control with critical sections or locks will probably only slow down the code.
Unless it is prohibitive in memory, I would give each thread a private copy of pout and sum into that, then add those copies together. Now the reading of the source array can be nicely divided up between the threads. If the pout array is not too large, your speedup should be decent.
Here is the crucial bit of code:
#pragma omp parallel shared(sum,threadsum)
{
int thread = omp_get_thread_num(),
myfirst = thread*ngroups;
#pragma omp for
for ( int i=0; i<biglen; i++ )
threadsum[ myfirst+indexes[i] ] += 1;
#pragma omp for
for ( int igrp=0; igrp<ngroups; igrp++ )
for ( int t=0; t<nthreads; t++ )
sum[igrp] += threadsum[ t*ngroups+igrp ];
}
Now for the tricky bit. I'm using an index array of size 100M, but the number of groups is crucial. With 5000 groups I get good speedup, but with only 50, even though I've eliminated things like false sharing, I get pathetic or no speedup. This is not clear to me yet.
Final word: I also coded #Laci's solution of just using a reduction. Testing on 1M groups output: For 2-8 threads the reduction solution is actually faster, but for higher thread counts I win by almost a factor of 2 because the reduction solution repeatedly adds the whole array while I sum it just once, and then in parallel. For smaller numbers of groups the reduction is probably preferred overall.
I would like to apply a pretty simple straightforward calculation on a n-by-d-dimensional array. The goal is to convert the sequential calculation to a parallel one using pthreads. My question is: what is the optimal way to split the problem? How could I significantly reduce the execution time of my script? I provide a sample sequential code in C and some thoughts on parallel implementations that I have already tried.
double * calcDistance(double * X ,int n, int d)
{
//calculate and return an array[n-1] of all the distances
//from the last point
double *distances = calloc(n,sizeof(double));
for(int i=0 ; i<n-1; i++)
{
//distances[i]=0;
for (int j=0; j< d; j++)
{
distances[i] += pow(X[(j+1)*n-1]-X[j*n+i], 2);
}
distances[i] = sqrt(distances[i]);
}
return distances;
}
I provide a main()-caller function in order for the sample to be complete and testable:
#include <stdio.h>
#include <stdlib.h>
#define N 10 //00000
#define D 2
int main()
{
srand(time(NULL));
//allocate the proper space for X
double *X = malloc(D*N*(sizeof(double)));
//fill X with numbers in space (0,1)
for(int i = 0 ; i<N ; i++)
{
for(int j=0; j<D; j++)
{
X[i+j*N] = (double) (rand() / (RAND_MAX + 2.0));
}
}
X = calcDistances(X, N, D);
return 0;
}
I have already tried utilizing pthreads asynchronously through the use of a global_index that is imposed to mutex and a local_index. Through the use of a while() loop, a local_index is assigned to each thread on each iteration. The local_index assignment depends on the global_index value at that time (both happening in a mutual exclusion block). The thread executes the computation on the distances[local_index] element.
Unfortunately this implementation has lead to a much slower program with a x10 or x20 bigger execution time compared to the sequential one that is cited above.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
Your inner loop jumps all over array X with a mixture of strides that varies with
the outer-loop iteration. Unless n and d are quite small,* this is likely to produce poor cache usage -- in the serial code, too, but parallelizing would amplify that effect. At least X is not written by the function, which improves the outlook. Also, there do not appear to be any data dependencies across iterations of the outer loop, which is good.
what is the optimal way to split the problem?
Probably the best available way would be to split outer-loop iterations among your threads. For T threads, have one perform iterations 0 ... (N / T) - 1, have the second do (N / T) ... (2 * N / T) - 1, etc..
How could I significantly reduce the execution time of my script?
The first thing I would do is use simple multiplication instead of pow to compute squares. It's unclear whether you stand to gain anything from parallelism.
I have already tried utilizing pthreads asynchronously through the use
of a global_index that is imposed to mutex and a local_index. [...]
If you have to involve a mutex, semaphore, or similar synchronization object then the task is probably hopeless. Happily (maybe) there does not appear to be any need for that. Assigning outer-loop iterations to threads dynamically is way over-engineered for this problem. Statically assigning iterations to threads as I already described will remove the need for such synchronization, and since the cost of the inner loop does not look like it will vary much for different outer-loop iterations, there probably will not be too much inefficiency introduced that way.
Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.
This sounds like what I described. It is one of the standard scheduling models provided by OMP, and one of the most efficient available for many problems, given that it does not itself require a mutex. It is somewhat sensitive to the relationship between the number of threads and the number of available execution units, however. For example, if you parallelize across five cores in a four-core machine, then one will have to wait to run until one of the others has finished -- best theoretical speedup 60%. Parallelizing the same computation across only four cores uses the compute resources more efficiently, for a best theoretical speedup of about 75%.
* If n and d are quite small, say anything remotely close to the values in the example driver program, then the overhead arising from parallelization has a good chance of overcoming any gains from parallel execution.
Assume to have two 3D arrays: A and B, with different number of elements each. I do some operations with values of A that correspond to some values of B with different indices.
For example : I use A[i][j][k] to calculate some quantities. Since each calculation is independent, I can do this using parallel for with no problem. But the updated value are used to increase the values of some positions of B array.
For example :
A[i][j][k]->C(a number)->B[l][m][n]. But at the same time a lot of writes could occur to B[l][m][n]. I use B[l][m][n]+=c to update B elements. Of course I cannot use OpenMP here because I violate the independence of loops. And neither do I know a priori the indices l,l,m in order to group them in buffer writes. Any good ideas on how to implement it in parallel? Maybe critical or atomic would benefit me here but I don't know how
A simplified version of my problem (1D)
for(int i=0,i<size_A,i++)
{
//Some code here to update A[i]. Each update is independent.
}
for(int j=0,j<size_A,j++)
{
//From A[j] B[m],B[m+1] are evaluated
int m=A[j]/dx;
double c;//Make some calculations
B[m] += c;
B[m+1] += c*c;
}
There are two ways to do this. Either you make each update atomic. This just looks like:
#pragma omp atomic update
B[m] += c;
Or each threads gets a private update-version of B and after the loop, the private copies are all safely added together. This is possible by using:
#pragma omp parallel for reduction(+:B)
for (...)
The latter requires OpenMP 4.5, but there are workarounds for earlier versions.
Which one is better for you depends on the rate of updates and size of the matrix. You have to measure both to be sure.
I would like to calculate a two dimensional float array 'Image2D' and do this faster by using 'OpenMP' to execute the outer for-loop in parallel.
In the loops, the position '[jy][jx]' inside 'Image2D' gets calculated. So, it is possible that, at the same moment, two (or more) threads want to increment 'Image2D' at the same position '[jy][jx]'. Of what I understood (but you may correct me) in that case only one increment is performed while the other increment is lost.
To avoid this, I thought to add the line of code '#pragma omp critical'. It makes sure only one thread can read/increment/write the variable 'Image2D'.
Unfortunately, this means that when a first thread is accessing 'Image2D', the other threads must wait until the first finished its job. For my code, this will slow down the execution tremendously because 'Image2D' is accessed all the time.
Moreover '#pragma omp critical' is too strict: it prevents multiple threads to access the whole array 'Image2D' while it is sufficient to prevent access to one element of 'Image2D' (i.e. one position 'Image2D[jy][jx]').
So, my question is: Is there a way to
(i) prevent multiple threads to write 'Image2D[jy][jx]' at the same time;
(ii) without letting the threads wait for each other unnecessarily and hence obtain fast code.
Thank you for your answer
#pragma omp parallel private( ia, iR, Cte, jjx, jx,jy )
{ // start parallel
#pragma omp for
for ( ia = i0a; ia <= i1a; ia++ ) {
// ... code removed ....
for ( iR = i0R; iR <= i1R; iR++ ) {
// ... code removed ....
// 'Cte' (float) and 'jjx' (float) are computed
for ( jy = j0y; jy <= j1y; jy++ ) {
// ... code removed ...
// 'jx' (int) gets computed
#pragma omp critical
Image2D[jy][jx] += Cte * ( 1.0 - ( jjx - jx ) ); // increment 'Image2D[jy][jx]'
// ... code removed ....
} // Next 'jy'
} // Next 'iR'
}// Next 'ia'
}// end parallel section
I don't know about OpenMP, but it's sounding like it's not quite giving you the level of control that you want.
As PetrH suggests the best answer maybe to have a private array per thread, then sum those arrays afterwards. Even that can be parallelised.
That'll work quite well provided you have the memory for it. If not you might have to consider an alternative. That could involve mutex semaphores.
Obviously your 2D array is sufficiently large to require OpenMP, so a mutex per cell is likely going to be wasteful. You could have a mutex per row, or one per 10 rows, etc. The fewer mutexes the greater the chance of two threads contending for the same group of cells. Also, whilst you would be taking and giving the mutex once per loop iteration (and calculating which mutex too), OSes nowadays go to a lot of effort to make mutex semaphore system calls fast. It could be that a little extra CPU time per thread gives you the scheduling you want and thus the overall benefit might be good.
I am writing a basic code to add two matrix and note down the time taken for single thread and 2 or more threads. In the approach first i divide the given two matrix (initialized randomly) in THREADS number of segments, and then each of these segments are sent to the addition module, which is started by the pthread_create call. The argument to the parallel addition function is the following.
struct thread_segment
{
matrix_t *matrix1, *matrix2, *matrix3;
int start_row, offset;
};
Pointers to two source and one destination matrix. (Once source and the destination may point to the same matrix). The start_row is the row from which the particular thread should start adding, and the offset tells till how much this thread should add starting from start_row.
The matrix_t is a simple structure defined as below:
typedef struct _matrix_t
{
TYPE **mat;
int r, c;
} matrix_t;
I have compiled it with 2 threads, but there is (almost) no speedup when i ran with 10000 x 10000 matrix. I am recording the running time with time -p program.
The matrix random initialization is also done in parallel like above.
I think this is because all the threads work on the same matrix address area, may be because of that a bottleneck is not making any speedup. Although all the threads will work on different segments of a matrix, they don't overlap.
Previously i implemented a parallel mergesort and a quicksort which also showed similar characteristics, i was able to get speedup when i copied the data segment on which a particular thread is to work to a newly allocated memory.
My question is that is this because of:
memory bottleneck?
Time benchmark is not done in the proper way?
Dataset too small?
Coding error?
Other
In the case, if it is a memory bottleneck, then do every parallel program use exclusive memory area, even when multiple access of the threads on the shared memory can be done without mutex?
EDIT
I can see speedup when i make the matrix segments like
curr = 0;
jump = matrix1->r / THREADS;
for (i=0; i<THREADS; i++)
{
th_seg[i].matrix1 = malloc (sizeof (matrix_t));
th_seg[i].matrix1->mat = &(matrix1->mat[curr]);
th_seg[i].matrix1->c = matrix1->c;
th_seg[i].matrix1->r = jump;
curr += jump;
}
That is before passing, assign the base address of the matrix to be processed by this thread in the structure and store the number of rows. So now the base address of each matrix is different for each thread. But only if i add some small dimention matrix 100 x 100 say, many times. Before calling the parallel add in each iteration, i am re assigning the random values. Is the speedup noticed here true? Or due to some other phenomena chaching effects?
To optimize memory usage, you may want to take a look at loop tiling. That would help cache memory to be updated. In this approach you divide your matrices into smaller chunks so the cache can hold the values for longer time and does not need to update it self frequently.
Also notice, creating to many threads just increases the overhead of switching among them.
In order to get a feeling that how much a proper implementation can affect the run time of a concurrent program, these are the results of a programs to multiply two matrices in naive, cocnurrent and tiling-concurrent :
seconds name
10.72 simpleMul
5.16 mulThread
3.19 tilingMulThread