Why for loop order effect running time in matrix multiplication? - c

I'am writing a C program to calculate the product of two matrix.
The problem That I noticed that the order of for loops does matter. For example:
for N=500
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
for (int k = 0 ; k < N; ++k) {
C[i*N+j]+=A[i*N+k] * B[k*N+j];
}
}
}
execution time (Seconds) : 1.1531820000
for (int j = 0; j < N; ++j) {
for (int k = 0 ; k < N; ++k) {
for (int i = 0; i < N; ++i) {
C[i*N+j]+=A[i*N+k] * B[k*N+j];
}
}
}
execution time (Seconds) : 2.6801300000
Matrix declaration:
A=(double*)malloc(sizeof(double)*N*N);
B=(double*)malloc(sizeof(double)*N*N);
C=(double*)malloc(sizeof(double)*N*N);
I run the test for 5 time than calculate the average. Anyone have an idea why is this happening?

With the second loop, you keep making many big jumps all the time when you increment i in the inner loop, and to a lesser extent k. The cache is probably not very happy with that.
The first loop is better, indeed it's even better if you invert the orders of j and k.
This is essentially a problem of data locality. Accesses to main memory are very slow on modern architectures, so your CPU will keep caches of recently accessed memory and try to prefetch memory that is likely to be accessed next. Those caches are very efficient at speeding up accesses that are grouped in the same small area, or accesses that follow a predictable pattern.
Here we turned a pattern where the CPU would make big jumps through memory and then come back into a nice mostly sequential pattern, hence the speedup.

Related

Optimizing a matrix transpose function with OpenMP

I have this code that transposes a matrix using loop tiling strategy.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
}
I want to optimize this with multi-threading using OpenMP, however I am not sure what to do when having so many nested for loops. I thought about just adding #pragma omp parallel for but doesn't this just parallelize the outer loop?
When you try to parallelize a loop nest, you should ask yourself how many levels are conflict free. As in: every iteration writing to a different location. If two iterations write (potentially) to the same location, you need to 1. use a reduction 2. use a critical section or other synchronization 3. decide that this loop is not worth parallelizing, or 4. rewrite your algorithm.
In your case, the write location depends on k,l. Since k<n and l*n, there are no pairs k.l / k',l' that write to the same location. Furthermore, there are no two inner iterations that have the same k or l value. So all four loops are parallel, and they are perfectly nested, so you can use collapse(4).
You could also have drawn this conclusion by considering the algorithm in the abstract: in a matrix transposition each target location is written exactly once, so no matter how you traverse the target data structure, it's completely parallel.
You can use the collapse specifier to parallelize over two loops.
# pragma omp parallel for collapse(2)
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < m; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*m];
}
}
}
}
As a side-note, I think you should swap the two innermost loops. Usually, when you have a choice between writing sequentially and reading sequentially, writing is more important for performance.
I thought about just adding #pragma omp parallel for but doesnt this
just parallelize the outer loop?
Yes. To parallelize multiple consecutive loops one can utilize OpenMP' collapse clause. Bear in mind, however that:
(As pointed out by Victor Eijkhout). Even though this does not directly apply to your code snippet, typically, for each new loop to be parallelized one should reason about potential newer race-conditions e.g., that this parallelization might have added. For example, different threads writing concurrently into the same dst position.
in some cases parallelizing nested loops may result in slower execution times than parallelizing a single loop. Since, the concrete implementation of the collapse clause uses a more complex heuristic (than the simple loop parallelization) to divide the iterations of the loops among threads, which can result in an overhead higher than the gains that it provides.
You should try to benchmark with a single parallel loop and then with two, and so on, and compare the results, accordingly.
void transposer(int n, int m, double *dst, const double *src) {
int blocksize;
#pragma omp parallel for collapse(...)
for (int i = 0; i < n; i += blocksize)
for (int j = 0; j < m; j += blocksize)
for (int k = i; k < i + blocksize; ++k
for (int l = j; l < j + blocksize; ++l)
dst[k + l*n] = src[l + k*m];
}
Depending upon the number of threads, cores, size of the matrices among other factors it might be that running sequential would actually be faster than the parallel versions. This is specially true in your code that is not very CPU intensive (i.e., dst[k + l*n] = src[l + k*m];)

Find a tight upper bound on complexity of the below program: [duplicate]

This question already has answers here:
Why is the runtime of this code O(n^5)?
(3 answers)
Closed 3 years ago.
Find a tight upper bound on the complexity of this program.
I tried. I think the time complexity of this code O(n2).
void function(int n)
{
int count = 0;
for (int i=0; i<n; i++)
for (int j=i; j< i*i; j++)
if (j%i == 0)
{
for (int k=0; k<j; k++)
printf("*");
}
}
But the answer given is O(n5). How?
EDIT: Yikes, after spending the time to answer this question, I discovered that this is a duplicate of a previous question that I had also answered three years ago. Oops!
The tightest bound you can get on this function's runtime is Θ(n4). Here's the derivation.
This is a great place to illustrate a great general strategy for determining the big-O of a piece of code:
"When in doubt, work inside out!"
Let's take your code:
for (int i=0; i<n; i++)
{
for (int j=i; j< i*i; j++)
{
if (j%i == 0)
{
for (int k=0; k<j; k++)
{
printf("*");
}
}
}
}
Our approach for analyzing the runtime complexity will be to repeatedly take the innermost loop and replace it with the amount of work that it does. When we're done, we'll have our final time complexity.
Let's begin with this innermost loop:
for (int k=0; k<j; k++)
{
printf("*");
}
The amount of work done here is Θ(j), since the number of loop iterations is directly proportional to j and we do a constant amount of work per loop iteration. So let's replace this loop with the simpler "do Θ(j) work," giving us this simplified loop nest:
for (int i=0; i<n; i++)
{
for (int j=i; j< i*i; j++)
{
if (j%i == 0)
{
do Θ(j) work
}
}
}
Now, let's take aim at what's now the innermost loop:
for (int j=i; j < i*i; j++)
{
if (j%i == 0)
{
do Θ(j) work
}
}
This loop is unusual in that the amount of work that it does varies pretty dramatically from one iteration to the next. Specifically:
most iterations will do only O(1) work, but
one out of every i iterations will do Θ(j) work.
To analyze this loop, we'll therefore split the work apart into these two constituent pieces and see how much each contributes to the total.
First, let's look at the "easy" iterations of the loop, which do only O(1) work. There are a total of Θ(i2) iterations of the loop (the loop starts counting at j = i and stops when j = i2 and i2 - i = Θ(i2). We can therefore bound the contribution of the of these "easy" loop iterations at O(i2) work.
Now, what about the "hard" loop iterations? These iterations occur when j = i, when j = 2i, when j = 3i, j = 4i, etc. Moreover, each of these "hard" iterations do work directly proportional to j during the iteration. This means that, if we add up the work across all of these iterations, the total work done is given by
i + 2i + 3i + 4i + ... + (i - 1)i.
We can simplify this as follows:
i + 2i + 3i + 4i + ... + (i - 1)i
= i(1 + 2 + 3 + ... + i-1)
= i · Θ(i2)
= Θ(i3).
This uses the fact that 1 + 2 + 3 + ... + k = k(k + 1) / 2 = Θ(k2), which is Gauss's famous sum.
So now we see that the work done by the inner loop here is given by
O(i2) work for the easy iterations, and
Θ(i3) work for the hard iterations.
Summing this up, we see that the total work done by this inner loop is Θ(i3). Continuing our process of working inside out, we can replace this inner loop with "do Θ(i3) work" to get the following:
for (int i=0; i<n; i++)
{
do Θ(i^3) work
}
From here, we see that the work done is
13 + 23 + 33 + ... + (n - 1)3,
and that sum is Θ(n4). (Specifically, it's n2(n - 1)2 / 4.)
So overall, the theory predicts that the runtime should be Θ(n4), which is a factor of n lower than the O(n5) bound you mentioned above. How does the theory match the practice?
I ran this code on a variety of values of n and counted how many times that a star was printed. Here's the values I got back:
n = 500: 7760510375
n = 1000: 124583708250
n = 1500: 631407093625
n = 2000: 1996668166500
n = 2500: 4876304426875
n = 3000: 10113753374750
n = 3500: 18739952510125
n = 4000: 31973339333000
n = 4500: 51219851343375
If the runtime is Θ(n4), then if we double the size of the input, we should scale the output by a factor of 16. If the runtime is Θ(n5), then doubling the input size should scale the output by a factor of 32. Here's what I found:
Ratio of n = 1000 to n = 500: 16.0535
Ratio of n = 2000 to n = 1000: 16.0267
Ratio of n = 3000 to n = 1500: 16.0178
Ratio of n = 4000 to n = 2000: 16.0133
This strongly suggests that the runtime of this function is indeed Θ(n4) rather than Θ(n5).
Hope this helps!
I agree with the other poster, except the innermost loop is O(n^2), as k spans from 0 to j, which itself spans up to n^2. This gives us the desired answer of O(n^5).
The first loop
for (int i=0; i<n; i++)
Gives your first O(n) multiplier.
Next, the second loop
for (int j=i; j< i*i; j++)
Itself multiplies by O(n^2) complexity because i is "like" n here.
The third loop
for (int k=0; k<j; k++)
Multiplies by another O(n^2) because j is "like" n^2 here.
So you get complexity O(n^5).

Dependency in Adjacent Elements while Updating 2d array (OpenMP)

I have a 2d array, say arr[SIZE][SIZE], which is updated in two for loops in the form:
for(int i = 0; i < SIZE; i++)
for(int j = 0; j < SIZE; j++)
arr[i][j] = new_value();
that I am trying to parallelise using OpenMP.
There are two instances where this occurs, the first is the function new_value_1() which relies on arr[i+1][j] and arr[i][j+1] (the "edge of the array" issue is already taken care of), which I can happily parallelise using the chessboard technique:
for(int l = 0; l < 2; l++)
#pragma omp parallel for private(i, j)
for(int i = 0; i < SIZE; i++)
for(int j = (i + l) % 2; j < SIZE; j += 2)
arr[i][j] = new_value_1();
The issue comes with this second instance, new_value_2(), which relies upon:
arr[i+1][j],
arr[i-1][j],
arr[i][j+1],
arr[i][j-1]
i.e. the adjacent elements in all directions.
Here, there is a dependency on the negative adjacent elements, so arr[0][2] = new_value_2() depends on the already updated value of arr[0][1] which would not be computed until the second l loop.
I was wondering if there was something I was missing in parallelising this way or if the issue is inherent with the way the algorithm works? If the former, any other approaches would be appreciated.
I was wondering if there was something I was missing in parallelising this way or if the issue is inherent with the way the algorithm works?
Yes, you're missing something, but not along the lines you probably hoped. Supposing that the idea is that the parallel version should compute the same result as the serial version, the checkerboard approach does not solve the problem even for the new_value_1() case. Consider this layout of array elements:
xoxoxo
oxoxox
xoxoxo
oxoxox
On the first of the two checkerboard passes, the 'x' elements are updated according to the original values of the 'o' elements -- so far, so good -- but on the second pass, the 'o' elements are updated based on the new values of the 'x' elements. The data dependency is broken, yes, but the overall computation is not the same. Of course, the same applies even more so to the new_value_2() computation. The checkerboarding doesn't really help you.
If the former, any other approaches would be appreciated.
You could do the computation in shells. For example, consider this labeling of the array elements:
0123
1234
2345
3456
All the elements with the same label can be computed in parallel (for both new_value() functions), provided that all those with numerically lesser labels are computed first. That might look something like this:
for(int l = 0; l < (2 * SIZE - 1); l++) {
int iterations = (l < SIZE) ? (l + 1) : (2 * SIZE - (l + l));
int i = (l < SIZE) ? l : (SIZE - 1);
int j = (l < SIZE) ? 0 : (1 + l - SIZE);
#pragma omp parallel for private(i, j, m)
for(int m = 0; m < iterations; m++) {
arr[i--][j++] = new_value_1();
}
}
You won't that way get as much of a benefit from parallelization, but that is an inherent aspect of the way the serial computation works, at least for the new_value_2() case.
For the new_value_1() case, though, you might do a little better by going row by row:
for(int i = 0; i < SIZE; i++)
#pragma omp parallel for private(j)
for(int j = 0; j < SIZE; j++)
arr[i][j] = new_value_1();
Alternatively, for the new_value_1() case only, you could potentially get a good speedup by storing the results in a separate array:
#pragma omp parallel for private(i, j), collapse(2)
for(int i = 0; i < SIZE; i++)
for(int j = 0; j < SIZE; j++)
arr2[i][j] = new_value_1();
If that requires you to copy the result back to the original array afterward then it might well not be worth it, but you could potentially avoid that by flipping back and forth between two arrays: compute from the first into the second, and then the next time around, compute from the second into the first ( if the problem [PSPACE] scaling permits having such an extended in-RAM allocation, i.e. it again comes at a [PTIME] cost, hopefully paid just once ).

OpenMP - Why does the number of comparisons decrease?

I have the following algorithm:
int hostMatch(long *comparisons)
{
int i = -1;
int lastI = textLength-patternLength;
*comparisons=0;
#pragma omp parallel for schedule(static, 1) num_threads(1)
for (int k = 0; k <= lastI; k++)
{
int j;
for (j = 0; j < patternLength; j++)
{
(*comparisons)++;
if (textData[k+j] != patternData[j])
{
j = patternLength+1; //break
}
}
if (j == patternLength && k > i)
i = k;
}
return i;
}
When changing num_threads I get the following results for number of comparisons:
01 = 9949051000
02 = 4992868032
04 = 2504446034
08 = 1268943748
16 = 776868269
32 = 449834474
64 = 258963324
Why is the number of comparisons not constant? It's interesting because the number of comparisons halves with the doubling of the number of threads. Is there some sort of race conditions going on for (*comparisons)++ where OMP just skips the increment if the variable is in use?
My current understanding is that the iterations of the k loop are split near-evenly amongst the threads. Each iteration has a private integer j as well as a private copy of integer k, and a non-parallel for loop which adds to the comparisons until terminated.
The naive way around the race condition to declare the operation as atomic update:
#pragma omp atomic update
(*comparisons)++;
Note that a critical section here is unnecessary and much more expensive. An atomic update can be declared on a primitive binary or unary operation on any l-value expression with scalar type.
Yet this is still not optimal because the value of *comparisons needs to be moved around between CPU caches all the time and a expensive locked instruction is performed. Instead you should use a reduction. For that you need another local variable, the pointer won't work here.
int hostMatch(long *comparisons)
{
int i = -1;
int lastI = textLength-patternLength;
long comparisons_tmp = 0;
#pragma omp parallel for reduction(comparisons_tmp:+)
for (int k = 0; k <= lastI; k++)
{
int j;
for (j = 0; j < patternLength; j++)
{
comparisons_tmp++;
if (textData[k+j] != patternData[j])
{
j = patternLength+1; //break
}
}
if (j == patternLength && k > i)
i = k;
}
*comparisons = comparisons_tmp;
return i;
}
P.S. schedule(static, 1) seems like a bad idea, since this will lead to inefficient memory access patterns on textData. Just leave it out and let the compiler do it's thing. If a measurement shows that it's not working efficiently, give it some better hints.
You said it yourself (*comparisons)++; has a race condition. It is a critical section that has to be serialized (I don't think (*pointer)++ is an atomic operation).
So basically you read the same value( i.e. 2) twice by two threads and then both increase it (3) and write it back. So you get 3 instead of 4. You have to make sure the operations on variables, that are not in the local scope of your parallelized function/loop, don't overlap.

Parallelizing giving wrong output

I got some problems trying to parallelize an algorithm. The intention is to do some modifications to a 100x100 matrix. When I run the algorithm without openMP everything runs smoothly in about 34-35 seconds, when I parallelize on 2 threads (I need it to be with 2 threads only) it gets down to like 22 seconds but the output is wrong and I think it's a synchronization problem that I cannot fix.
Here's the code :
for (p = 0; p < sapt; p++){
memset(count,0,Nc*sizeof(int));
for (i = 0; i < N; i ++){
for (j = 0; j < N; j++){
for( m = 0; m < Nc; m++)
dist[m] = N+1;
omp_set_num_threads(2);
#pragma omp parallel for shared(configurationMatrix, dist) private(k,m) schedule(static,chunk)
for (k = 0; k < N; k++){
for (m = 0; m < N; m++){
if (i == k && j == m)
continue;
if (MAX(abs(i-k),abs(j-m)) < dist[configurationMatrix[k][m]])
dist[configurationMatrix[k][m]] = MAX(abs(i-k),abs(j-m));
}
}
int max = -1;
for(m = 0; m < Nc; m++){
if (dist[m] == N+1)
continue;
if (dist[m] > max){
max = dist[m];
configurationMatrix2[i][j] = m;
}
}
}
}
memcpy(configurationMatrix, configurationMatrix2, N*N*sizeof(int));
#pragma omp parallel for shared(count, configurationMatrix) private(i,j)
for (i = 0; i < N; i ++)
for (j = 0; j < N; j++)
count[configurationMatrix[i][j]] ++;
for (i = 0; i < Nc; i ++)
fprintf(out,"%i ", count[i]);
fprintf(out, "\n");
}
In which : sapt = 100;
count -> it's a vector that holds me how many of an each element of the matrix I'm having on each step;
(EX: count[1] = 60 --> I have the element '1' 60 times in my matrix and so on)
dist --> vector that holds me max distances from element i,j of let's say value K to element k,m of same value K.
(EX: dist[1] = 10 --> distance from the element of value 1 to the furthest element of value 1)
Then I write stuff down in an output file, but again, wrong output.
If I understand your code correctly this line
count[configurationMatrix[i][j]] ++;
increments count at the element whose index is at configurationMatrix[i][j]. I don't see that your code takes any steps to ensure that threads are not simultaneously trying to increment the same element of count. It's entirely feasible that two different elements of configurationMatrix provide the same index into count and that those two elements are handled by different threads. Since ++ is not an atomic operation your code has a data race; multiple threads can contend for update access to the same variable and you lose any guarantees of correctness, or determinism, in the result.
I think you may have other examples of the same problem in other parts of your code too. You are silent on the errors you observe in the results of the parallel program compared with the results from the serial program yet those errors are often very useful in diagnosing a problem. For example, if the results of the parallel program are not the same every time you run it, that is very suggestive of a data race somewhere in your code.
How to fix this ? Since you only have 2 threads the easiest fix would be to not parallelise this part of the program. You could wrap the data race inside an OpenMP critical section but that's really just another way of serialising your code. Finally, you could possibly modify your algorithm and data structures to avoid this problem entirely.

Resources