OpenMP - Why does the number of comparisons decrease? - c

I have the following algorithm:
int hostMatch(long *comparisons)
{
int i = -1;
int lastI = textLength-patternLength;
*comparisons=0;
#pragma omp parallel for schedule(static, 1) num_threads(1)
for (int k = 0; k <= lastI; k++)
{
int j;
for (j = 0; j < patternLength; j++)
{
(*comparisons)++;
if (textData[k+j] != patternData[j])
{
j = patternLength+1; //break
}
}
if (j == patternLength && k > i)
i = k;
}
return i;
}
When changing num_threads I get the following results for number of comparisons:
01 = 9949051000
02 = 4992868032
04 = 2504446034
08 = 1268943748
16 = 776868269
32 = 449834474
64 = 258963324
Why is the number of comparisons not constant? It's interesting because the number of comparisons halves with the doubling of the number of threads. Is there some sort of race conditions going on for (*comparisons)++ where OMP just skips the increment if the variable is in use?
My current understanding is that the iterations of the k loop are split near-evenly amongst the threads. Each iteration has a private integer j as well as a private copy of integer k, and a non-parallel for loop which adds to the comparisons until terminated.

The naive way around the race condition to declare the operation as atomic update:
#pragma omp atomic update
(*comparisons)++;
Note that a critical section here is unnecessary and much more expensive. An atomic update can be declared on a primitive binary or unary operation on any l-value expression with scalar type.
Yet this is still not optimal because the value of *comparisons needs to be moved around between CPU caches all the time and a expensive locked instruction is performed. Instead you should use a reduction. For that you need another local variable, the pointer won't work here.
int hostMatch(long *comparisons)
{
int i = -1;
int lastI = textLength-patternLength;
long comparisons_tmp = 0;
#pragma omp parallel for reduction(comparisons_tmp:+)
for (int k = 0; k <= lastI; k++)
{
int j;
for (j = 0; j < patternLength; j++)
{
comparisons_tmp++;
if (textData[k+j] != patternData[j])
{
j = patternLength+1; //break
}
}
if (j == patternLength && k > i)
i = k;
}
*comparisons = comparisons_tmp;
return i;
}
P.S. schedule(static, 1) seems like a bad idea, since this will lead to inefficient memory access patterns on textData. Just leave it out and let the compiler do it's thing. If a measurement shows that it's not working efficiently, give it some better hints.

You said it yourself (*comparisons)++; has a race condition. It is a critical section that has to be serialized (I don't think (*pointer)++ is an atomic operation).
So basically you read the same value( i.e. 2) twice by two threads and then both increase it (3) and write it back. So you get 3 instead of 4. You have to make sure the operations on variables, that are not in the local scope of your parallelized function/loop, don't overlap.

Related

Optimizing a matrix transpose function with OpenMP

I have this code that transposes a matrix using loop tiling strategy.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
}
I want to optimize this with multi-threading using OpenMP, however I am not sure what to do when having so many nested for loops. I thought about just adding #pragma omp parallel for but doesn't this just parallelize the outer loop?
When you try to parallelize a loop nest, you should ask yourself how many levels are conflict free. As in: every iteration writing to a different location. If two iterations write (potentially) to the same location, you need to 1. use a reduction 2. use a critical section or other synchronization 3. decide that this loop is not worth parallelizing, or 4. rewrite your algorithm.
In your case, the write location depends on k,l. Since k<n and l*n, there are no pairs k.l / k',l' that write to the same location. Furthermore, there are no two inner iterations that have the same k or l value. So all four loops are parallel, and they are perfectly nested, so you can use collapse(4).
You could also have drawn this conclusion by considering the algorithm in the abstract: in a matrix transposition each target location is written exactly once, so no matter how you traverse the target data structure, it's completely parallel.
You can use the collapse specifier to parallelize over two loops.
# pragma omp parallel for collapse(2)
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
As a side-note, I think you should swap the two innermost loops. Usually, when you have a choice between writing sequentially and reading sequentially, writing is more important for performance.
I thought about just adding #pragma omp parallel for but doesnt this
just parallelize the outer loop?
Yes. To parallelize multiple consecutive loops one can utilize OpenMP' collapse clause. Bear in mind, however that:
(As pointed out by Victor Eijkhout). Even though this does not directly apply to your code snippet, typically, for each new loop to be parallelized one should reason about potential newer race-conditions e.g., that this parallelization might have added. For example, different threads writing concurrently into the same dst position.
in some cases parallelizing nested loops may result in slower execution times than parallelizing a single loop. Since, the concrete implementation of the collapse clause uses a more complex heuristic (than the simple loop parallelization) to divide the iterations of the loops among threads, which can result in an overhead higher than the gains that it provides.
You should try to benchmark with a single parallel loop and then with two, and so on, and compare the results, accordingly.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
#pragma omp parallel for collapse(...)
for (int i = 0; i < n; i += blocksize)
for (int j = 0; j < m; j += blocksize)
for (int k = i; k < i + blocksize; ++k
for (int l = j; l < j + blocksize; ++l)
dst[k + l*n] = src[l + k*m];
}
Depending upon the number of threads, cores, size of the matrices among other factors it might be that running sequential would actually be faster than the parallel versions. This is specially true in your code that is not very CPU intensive (i.e., dst[k + l*n] = src[l + k*m];)

Search of max value and index in a vector

I'm trying to parallelize this piece of code that search for a max on a column.
The problem is that the parallelize version runs slower than the serial
Probably the search of the pivot (max on a column) is slower due the syncrhonization on the maximum value and the index, right?
int i,j,t,k;
// Decrease the dimension of a factor 1 and iterate each time
for (i=0, j=0; i < rwA, j < cwA; i++, j++) {
int i_max = i; // max index set as i
double matrixA_maxCw_value = fabs(matrixA[i_max][j]);
#pragma omp parallel for reduction(max:matrixA_maxCw_value,i_max) //OVERHEAD
for (t = i+1; t < rwA; t++) {
if (fabs(matrixA[t][j]) > matrixA_maxCw_value) {
matrixA_maxCw_value = matrixA[t][j];
i_max = t;
}
}
if (matrixA[i_max][j] == 0) {
j++; //Check if there is a pivot in the column, if not pass to the next column
}
else {
//Swap the rows, of A, L and P
#pragma omp parallel for //OVERHEAD
for (k = 0; k < cwA; k++) {
swapRows(matrixA, i, k, i_max);
swapRows(P, i, k, i_max);
if(k < i) {
swapRows(L, i, k, i_max);
}
}
lupFactorization(matrixA,L,i,j,rwA);
}
}
void swapRows(double **matrixA, int i, int j, int i_max) {
double temp_val = matrixA[i][j];
matrixA[i][j] = matrixA[i_max][j];
matrixA[i_max][j] = temp_val;
}
I do not want a different code but I want only know why this happens, on a matrix of dimension 1000x1000 the serial version takes 4.1s and the parallelized version 4.28s
The same thing (the overhead is very small but there is) happens on the swap of the rows that theoretically can be done in parallel without problem, why it happens?
There is at least two things wrong with your parallelization
#pragma omp parallel for reduction(max:matrixA_maxCw_value,i_max) //OVERHEAD
for (t = i+1; t < rwA; t++) {
if (fabs(matrixA[t][j]) > matrixA_maxCw_value) {
matrixA_maxCw_value = matrixA[t][j];
i_max = t;
}
}
You are getting the biggest index of all of them, but that does not mean that it belongs to the max value. For instance looking at the following array:
[8, 7, 6, 5, 4 ,3, 2 , 1]
if you parallelized with two threads, the first thread will have max=8 and index=0, the second thread will have max=4 and index=4. After the reduction is done the max will be 8 but the index will be 4 which is obviously wrong.
OpenMP has in-build reduction functions that consider a single target value, however in your case you want to reduce taking into account 2 values the max and the array index. After OpenMP 4.0 one can create its own reduction functions (i.e., User-Defined Reduction).
You can have a look at a full example implementing such logic here
The other issue is this part:
#pragma omp parallel for //OVERHEAD
for (k = 0; k < cwA; k++) {
swapRows(matrixA, i, k, i_max);
swapRows(P, i, k, i_max);
if(k < i) {
swapRows(L, i, k, i_max);
}
}
You are swapping those elements in parallel, which leads to inconsistent state.
First you need to solve those issue before analyzing why your code is not having speedups.
First correctness then efficiency. But don't except much speedups with the current implementation, the amount of computation performed in parallelism is that much to justify the overhead of the parallelism.

How does openMP COLLAPSE works internally?

I am trying the openMP parallelism, to multiply 2 matrixes using 2 threads.
I understand how the outer loop parallelism works (i.e without the "collapse(2)" works).
Now, using collapse:
#pragma omp parallel for collapse(2) num_threads(2)
for( i = 0; i < m; i++)
for( j = 0; j < n; j++)
{
s = 0;
for( k = 0; k < p; k++)
s += A[i][k] * B[k][j];
C[i][j] = s;
}
From what I gather, collapse "collapses" the loops into a single big loop, and then uses threads in the big loop. So, for the previous code, i think that it would be equivalent to something like this:
#pragma omp parallel for num_threads(2)
for (ij = 0; ij <n*m; ij++)
{
i= ij/n;
j= mod(ij,n);
s = 0;
for( k = 0; k < p; k++)
s += A[i][k] * B[k][j];
C[i][j] = s;
}
My questions are:
Is that how it works? I have not found any explaination on how it
"collapses" the loops.
If yes, what is the benefit in using that? Doesn't
it divide the jobs between 2 threads EXACTLY like the parallelism without
collapsing?. If not, then does how it work?
PS: Now that i am thinking a little bit more, in case that n is an odd number, say 3, without the collapse one thread would have 2 iterations, and the other just one. That results in uneven jobs for the threads, and a bit less efficient.
If we were to use my collapse equivalent (if that is how collapse indeed works) each thread would have "1.5" iterations. If n would be very large, that would not really matter, would it? Not to mention, doing that i= ij/n; j= mod(ij,n); every time, it decreases performance, doesn't it?
The OpenMP specification says just (page 58 of Version 4.5):
If a collapse clause is specified with a parameter value greater than 1, then the iterations of the associated loops to which the clause applies are collapsed into one larger iteration space that is then divided according to the schedule clause. The sequential execution of the iterations in these associated loops determines the order of the iterations in the collapsed iteration space.
So, basically your logic is correct, except that your code is equivalent to the schedule(static,1) collapse(2) case, i.e. iteration chunk size of 1. In the general case, most OpenMP runtimes have default schedule of schedule(static), which means that the chunk size will be (approximately) equal to the number of iterations divided by the number of threads. The compiler may then use some optimisation to implement it by e.g. running a partial inner loop for a fixed value for the outer loop, then an integer number of outer iterations with complete inner loops, then a partial inner loop again.
For example, the following code:
#pragma omp parallel for collapse(2)
for (int i = 0; i < 100; i++)
for (int j = 0; j < 100; j++)
a[100*i+j] = i+j;
gets transformed by the OpenMP engine of GCC into:
<bb 3>:
i = 0;
j = 0;
D.1626 = __builtin_GOMP_loop_static_start (0, 10000, 1, 0, &.istart0.3, &.iend0.4);
if (D.1626 != 0)
goto <bb 8>;
else
goto <bb 5>;
<bb 8>:
.iter.1 = .istart0.3;
.iend0.5 = .iend0.4;
.tem.6 = .iter.1;
D.1630 = .tem.6 % 100;
j = (int) D.1630;
.tem.6 = .tem.6 / 100;
D.1631 = .tem.6 % 100;
i = (int) D.1631;
<bb 4>:
D.1632 = i * 100;
D.1633 = D.1632 + j;
D.1634 = (long unsigned int) D.1633;
D.1635 = D.1634 * 4;
D.1636 = .omp_data_i->a;
D.1637 = D.1636 + D.1635;
D.1638 = i + j;
*D.1637 = D.1638;
.iter.1 = .iter.1 + 1;
if (.iter.1 < .iend0.5)
goto <bb 10>;
else
goto <bb 9>;
<bb 9>:
D.1639 = __builtin_GOMP_loop_static_next (&.istart0.3, &.iend0.4);
if (D.1639 != 0)
goto <bb 8>;
else
goto <bb 5>;
<bb 10>:
j = j + 1;
if (j <= 99)
goto <bb 4>;
else
goto <bb 11>;
<bb 11>:
j = 0;
i = i + 1;
goto <bb 4>;
<bb 5>:
__builtin_GOMP_loop_end_nowait ();
<bb 6>:
This is a C-like representation of the program's abstract syntax tree, which probably a bit hard to read, but what it does is, it uses modulo arithmetic only once to compute the initial values of i and j based on the start of the iteration block (.istart0.3) determined by the call to GOMP_loop_static_start(). Then it simply increases i and j as one would expect a loop nest to be implemented, i.e. increase j until it hits 100, then reset j to 0 and increase i. At the same time, it also keeps the current iteration number from the collapsed iteration space in .iter.1, basically iterating at the same time both the single collapsed loop and the two nested loops.
As to case when the number of threads does not divide the number of iterations, the OpenMP standard says:
When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. The size of the chunks is unspecified in this case.
The GCC implementation leaves the threads with highest IDs doing one iteration less. Other possible distribution strategies are outlined in the note on page 61. The list is by no means exhaustive.
The exact behavior is not specified by the standard itself. However, the standard requires that the inner loop has exactly the same iterations for each iteration of the outer loop. This allows the following transformation:
#pragma omp parallel
{
int iter_total = m * n;
int iter_per_thread = 1 + (iter_total - 1) / omp_num_threads(); // ceil
int iter_start = iter_per_thread * omp_get_thread_num();
int iter_end = min(iter_iter_start + iter_per_thread, iter_total);
int ij = iter_start;
for (int i = iter_start / n;; i++) {
for (int j = iter_start % n; j < n; j++) {
// normal loop body
ij++;
if (ij == iter_end) {
goto end;
}
}
}
end:
}
From skimming the disassembly, i believe this is similar to what GCC does. It does avoid the per-iteration division/modulo, but costs one register and addition per inner iterator. Of course it will vary for different scheduling strategies.
Collapsing loops does increase the number of loop iterations that can be assigned to threads, thus helping with load balance or even exposing enough parallel work in the first place.

Longest Common Subsequence with openMP

I'm writing a parallel version of the Longest Common Subsequence algorithm using openMP.
The sequential version is the following (and it works correctly):
// Preparing first row and first column with zeros
for(j=0; j < (len2+1); j++)
score[0][j] = 0;
for(i=0; i < (len1+1); i++)
score[i][0] = 0;
// Calculating scores
for(i=1; i < (len1+1); i++) {
for(j=1; j < (len2+1) ;j++) {
if (seq1[i-1] == seq2[j-1]) {
score[i][j] = score[i-1][j-1] + 1;
}
else {
score[i][j] = max(score[i-1][j], score[i][j-1]);
}
}
}
The critical part is filling up the score matrix and this is the part I'm trying to mostly parallelize.
One way to do it (which I chose) is: filling up the matrix by anti diagonals, so left, top and top-left dependecies are always satisfied. In a nutshell, I keep track of the diagonal (third loop, variable i below) and threads fill up that diagonal in parallel.
For this purpose, I've written this code:
void parallelCalculateLCS(int len1, int len2, char *seq1, char *seq2) {
int score[len1 + 1][len2 + 1];
int i, j, k, iam;
char *lcs = NULL;
for(i=0;i<len1+1;i++)
for(j=0;j<len2+1;j++)
score[i][j] = -1;
#pragma omp parallel default(shared) private(iam)
{
iam = omp_get_thread_num();
// Preparing first row and first column with zeros
#pragma omp for
for(j=0; j < (len2+1); j++)
score[0][j] = iam;
#pragma omp for
for(i=0; i < (len1+1); i++)
score[i][0] = iam;
// Calculating scores
for(i=1; i < (len1+1); i++) {
k=i;
#pragma omp for
for(j=1; j <= i; j++) {
if (seq1[k-1] == seq2[j-1]) {
// score[k][j] = score[k-1][j-1] + 1;
score[k][j] = iam;
}
else {
// score[k][j] = max(score[k-1][j], score[k][j-1]);
score[k][j] = iam;
}
#pragma omp atomic
k--;
}
}
}
}
The first two loops (first row and column) work correctly and threads fill up cells in a balanced way.
When it comes to fill up the matrix (diagonally), nothing works well. I tried to debug it, but it seems that threads act and write things randomly.
I can't figure out what's going wrong, since in the first two loops there were no problems at all.
Any idea?
P.S. I know that accessing matrix in a diagonal way is very cache-unfriendly and threads could be unbalanced, but I only need it to work by now.
P.S. #2 I don't know if it could be useful, but my CPU has up to 8 threads.
#pragma omp atomic means that the processors will perform the operation one at a time. You are looking for #pragma omp for private(k) : the processors will no longer share the same value. Bye, Francis
The following nested for loop
#pragma omp for
for(j=1; j <= i; j++)
will be executed in parallel, each thread with a different value of j in no specific order.
As nothing is specified in the omp for section, k will be shared by default between all threads. So depending on the order of the threads, k will be decremented at an unknown time (even with the omp atomic). So for a fixed j, the value of k might change during the execution of the body of the for loop (between the if clauses, ...).

Parallelizing giving wrong output

I got some problems trying to parallelize an algorithm. The intention is to do some modifications to a 100x100 matrix. When I run the algorithm without openMP everything runs smoothly in about 34-35 seconds, when I parallelize on 2 threads (I need it to be with 2 threads only) it gets down to like 22 seconds but the output is wrong and I think it's a synchronization problem that I cannot fix.
Here's the code :
for (p = 0; p < sapt; p++){
memset(count,0,Nc*sizeof(int));
for (i = 0; i < N; i ++){
for (j = 0; j < N; j++){
for( m = 0; m < Nc; m++)
dist[m] = N+1;
omp_set_num_threads(2);
#pragma omp parallel for shared(configurationMatrix, dist) private(k,m) schedule(static,chunk)
for (k = 0; k < N; k++){
for (m = 0; m < N; m++){
if (i == k && j == m)
continue;
if (MAX(abs(i-k),abs(j-m)) < dist[configurationMatrix[k][m]])
dist[configurationMatrix[k][m]] = MAX(abs(i-k),abs(j-m));
}
}
int max = -1;
for(m = 0; m < Nc; m++){
if (dist[m] == N+1)
continue;
if (dist[m] > max){
max = dist[m];
configurationMatrix2[i][j] = m;
}
}
}
}
memcpy(configurationMatrix, configurationMatrix2, N*N*sizeof(int));
#pragma omp parallel for shared(count, configurationMatrix) private(i,j)
for (i = 0; i < N; i ++)
for (j = 0; j < N; j++)
count[configurationMatrix[i][j]] ++;
for (i = 0; i < Nc; i ++)
fprintf(out,"%i ", count[i]);
fprintf(out, "\n");
}
In which : sapt = 100;
count -> it's a vector that holds me how many of an each element of the matrix I'm having on each step;
(EX: count[1] = 60 --> I have the element '1' 60 times in my matrix and so on)
dist --> vector that holds me max distances from element i,j of let's say value K to element k,m of same value K.
(EX: dist[1] = 10 --> distance from the element of value 1 to the furthest element of value 1)
Then I write stuff down in an output file, but again, wrong output.
If I understand your code correctly this line
count[configurationMatrix[i][j]] ++;
increments count at the element whose index is at configurationMatrix[i][j]. I don't see that your code takes any steps to ensure that threads are not simultaneously trying to increment the same element of count. It's entirely feasible that two different elements of configurationMatrix provide the same index into count and that those two elements are handled by different threads. Since ++ is not an atomic operation your code has a data race; multiple threads can contend for update access to the same variable and you lose any guarantees of correctness, or determinism, in the result.
I think you may have other examples of the same problem in other parts of your code too. You are silent on the errors you observe in the results of the parallel program compared with the results from the serial program yet those errors are often very useful in diagnosing a problem. For example, if the results of the parallel program are not the same every time you run it, that is very suggestive of a data race somewhere in your code.
How to fix this ? Since you only have 2 threads the easiest fix would be to not parallelise this part of the program. You could wrap the data race inside an OpenMP critical section but that's really just another way of serialising your code. Finally, you could possibly modify your algorithm and data structures to avoid this problem entirely.

Resources