I need to use MPI and OpenMP (2 different problems) to parallelize a code from Sbac-Pad marathon (reference: http://lspd.mackenzie.br/marathon/18/problems.html). I am working on the himeno benchmark. I believe the only part of this code that is worth parallellizing is the jacobi function:
#define MR(mt,n,r,c,d) mt->m[(n) * mt->mrows * mt->mcols * mt->mdeps + (r) * mt->mcols* mt->mdeps + (c) * mt->mdeps + (d)]
struct Matrix {
float* m;
int mnums;
int mrows;
int mcols;
int mdeps;
};
float
jacobi(int nn, Matrix* a,Matrix* b,Matrix* c,
Matrix* p,Matrix* bnd,Matrix* wrk1,Matrix* wrk2)
{
int i,j,k,n,imax,jmax,kmax;
float gosa,s0,ss;
imax= p->mrows-1;
jmax= p->mcols-1;
kmax= p->mdeps-1;
for(n=0 ; n<nn ; n++){
gosa = 0.0;
for(i=1 ; i<imax; i++)
for(j=1 ; j<jmax ; j++)
for(k=1 ; k<kmax ; k++){
s0= MR(a,0,i,j,k)*MR(p,0,i+1,j, k)
+ MR(a,1,i,j,k)*MR(p,0,i, j+1,k)
+ MR(a,2,i,j,k)*MR(p,0,i, j, k+1)
+ MR(b,0,i,j,k)
*( MR(p,0,i+1,j+1,k) - MR(p,0,i+1,j-1,k)
- MR(p,0,i-1,j+1,k) + MR(p,0,i-1,j-1,k) )
+ MR(b,1,i,j,k)
*( MR(p,0,i,j+1,k+1) - MR(p,0,i,j-1,k+1)
- MR(p,0,i,j+1,k-1) + MR(p,0,i,j-1,k-1) )
+ MR(b,2,i,j,k)
*( MR(p,0,i+1,j,k+1) - MR(p,0,i-1,j,k+1)
- MR(p,0,i+1,j,k-1) + MR(p,0,i-1,j,k-1) )
+ MR(c,0,i,j,k) * MR(p,0,i-1,j, k)
+ MR(c,1,i,j,k) * MR(p,0,i, j-1,k)
+ MR(c,2,i,j,k) * MR(p,0,i, j, k-1)
+ MR(wrk1,0,i,j,k);
ss= (s0*MR(a,3,i,j,k) - MR(p,0,i,j,k))*MR(bnd,0,i,j,k);
gosa+= ss*ss;
MR(wrk2,0,i,j,k)= MR(p,0,i,j,k) + omega*ss;
}
for(i=1 ; i<imax ; i++)
for(j=1 ; j<jmax ; j++)
for(k=1 ; k<kmax ; k++)
MR(p,0,i,j,k)= MR(wrk2,0,i,j,k);
} /* end n loop */
return(gosa);
}
The problem is, this function seems to have a sequential nature, since every iteration of nn is dependant on the last one. What I tried, using MPI, was making an auxiliar variable for gosa (auxgosa), and using MPI_REDUCE after the i j k for loops, like the following (root process is rank = 0):
//rank is the current process
//size is the total amount of processes
int start = ((imax+1)/size)*rank;
int stop = ((imax+1)/size)*(rank+1)-1;
if(rank == 0){start++;}
for(n=0 ; n<nn ; n++){
gosa = 0.0;
auxgosa = 0.0;
for(i=start ; i<stop; i++)
for(j=1 ; j<jmax ; j++)
for(k=1 ; k<kmax ; k++){
s0= MR(aa,0,i,j,k)*MR(pp,0,i+1,j,k)
+ MR(aa,1,i,j,k)*MR(pp,0,i, j+1,k)
+ MR(aa,2,i,j,k)*MR(pp,0,i, j, k+1)
+ MR(bb,0,i,j,k)
*( MR(pp,0,i+1,j+1,k) - MR(pp,0,i+1,j-1,k)
- MR(pp,0,i-1,j+1,k) + MR(pp,0,i-1,j-1,k) )
+ MR(bb,1,i,j,k)
*( MR(pp,0,i,j+1,k+1) - MR(pp,0,i,j-1,k+1)
- MR(pp,0,i,j+1,k-1) + MR(pp,0,i,j-1,k-1) )
+ MR(bb,2,i,j,k)
*( MR(pp,0,i+1,j,k+1) - MR(pp,0,i-1,j,k+1)
- MR(pp,0,i+1,j,k-1) + MR(pp,0,i-1,j,k-1) )
+ MR(cc,0,i,j,k) * MR(pp,0,i-1,j, k)
+ MR(cc,1,i,j,k) * MR(pp,0,i, j-1,k)
+ MR(cc,2,i,j,k) * MR(pp,0,i, j, k-1)
+ MR(awrk1,0,i,j,k);
ss= (s0*MR(aa,3,i,j,k) - MR(pp,0,i,j,k))*MR(abnd,0,i,j,k);
auxgosa+= ss*ss;
MR(awrk2,0,i,j,k)= MR(pp,0,i,j,k) + omega*ss;
}
MPI_Reduce(&auxgosa,&gosa,1,MPI_FLOAT,MPI_SUM,0,MPI_COMM_WORLD);
for(i=1 ; i<imax ; i++)
for(j=1 ; j<jmax ; j++)
for(k=1 ; k<kmax ; k++)
MR(pp,0,i,j,k)= MR(awrk2,0,i,j,k);
} /* end n loop */
Unfortunately, this didn't work. Could anyone give me some insight about this? I plan using a similar strategy with OpenMP.
If awrk2 is different from a, p, b, c and wrk1, then there is no loop carried dependence.
A simple google search will point you to parallelized versions of the Himeno benchmark (MPI, OpenMP and hybrid MPI+OpenMP versions are available).
So I'm optimizing a loop (as homework) that adds 10,000 elements 600,000 times. The time without optimizations is 23.34s~ and my goal is to reach less than 7 seconds for a B and less than 5 for an A.
So I started my optimizations by first unrolling the loop like this.
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7];
This reduces the runtime to about 6.4~ seconds (I can hit about 6 if I unroll further).
So I figured I would try adding sub-sums and making a final sum at the end to save time on read-write dependencies and I came up with code that looks like this.
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum0 += array[j] + array[j+1];
sum1 += array[j+2] + array[j+3];
sum2 += array[j+4] + array[j+5];
sum3 += array[j+6] + array[j+7];
However this increases the runtime to about 6.8 seconds
I tried a similar technique using pointers and the best I could do was about 15 seconds.
I only know that the machine I'm running this on (as it is a service purchased by the school) is a 32 bit, remote, Intel based, Linux virtual server that I believe is running Red Hat.
I've tried every technique I can think of to speed up the code, but they all seem to have the opposite effect. Could someone elaborate on what I'm doing wrong? Or another technique I could use to lower the runtime? The best the teacher could do was about 4.8 seconds.
As an additional condition I cannot have more than 50 lines of code in the finished project, so doing something complex is likely not possible.
Here is a full copy of both sources
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
// double sum0 = 0;
// double sum1 = 0;
// double sum2 = 0;
// double sum3 = 0;
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - ACTUAL NAME\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7];
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
// sum = sum0 + sum1 + sum2 + sum3;
// ... and this one.
return 0;
}
Broken up code
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
double sum0 = 0;
double sum1 = 0;
double sum2 = 0;
double sum3 = 0;
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - ACTUAL NAME\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum0 += array[j] + array[j+1];
sum1 += array[j+2] + array[j+3];
sum2 += array[j+4] + array[j+5];
sum3 += array[j+6] + array[j+7];
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
sum = sum0 + sum1 + sum2 + sum3;
// ... and this one.
return 0;
}
ANSWER
The 'time' application we use to judge the grade is a little bit off. The best I could do was 4.9~ by unrolling the loop 50 times and grouping it like I did below using TomKarzes's basic format.
int j;
for (j = 0; j < ARRAY_SIZE; j += 50) {
sum +=(((((((array[j] + array[j+1]) + (array[j+2] + array[j+3])) +
((array[j+4] + array[j+5]) + (array[j+6] + array[j+7]))) +
(((array[j+8] + array[j+9]) + (array[j+10] + array[j+11])) +
((array[j+12] + array[j+13]) + (array[j+14] + array[j+15])))) +
((((array[j+16] + array[j+17]) + (array[j+18] + array[j+19]))))) +
(((((array[j+20] + array[j+21]) + (array[j+22] + array[j+23])) +
((array[j+24] + array[j+25]) + (array[j+26] + array[j+27]))) +
(((array[j+28] + array[j+29]) + (array[j+30] + array[j+31])) +
((array[j+32] + array[j+33]) + (array[j+34] + array[j+35])))) +
((((array[j+36] + array[j+37]) + (array[j+38] + array[j+39])))))) +
((((array[j+40] + array[j+41]) + (array[j+42] + array[j+43])) +
((array[j+44] + array[j+45]) + (array[j+46] + array[j+47]))) +
(array[j+48] + array[j+49])));
}
I experimented with the grouping a bit. On my machine, with my gcc, I found that the following worked best:
for (j = 0; j < ARRAY_SIZE; j += 16) {
sum = sum +
(array[j ] + array[j+ 1]) +
(array[j+ 2] + array[j+ 3]) +
(array[j+ 4] + array[j+ 5]) +
(array[j+ 6] + array[j+ 7]) +
(array[j+ 8] + array[j+ 9]) +
(array[j+10] + array[j+11]) +
(array[j+12] + array[j+13]) +
(array[j+14] + array[j+15]);
}
In other words, it's unrolled 16 times, it groups the sums into pairs, and then it adds the pairs linearly. I also removed the += operator, which affects when sum is first used in the additions.
I found that the measured times varied significantly from one run to the next, even without changing anything, so I suggest timing each version several times before making any conclusions about whether the time has improved or gotten worse.
I'd be interested to know what numbers you get on your machine with this version of the inner loop.
Update: Here's my current fastest version (on my machine, with my compiler):
int j1, j2;
j1 = 0;
do {
j2 = j1 + 20;
sum = sum +
(array[j1 ] + array[j1+ 1]) +
(array[j1+ 2] + array[j1+ 3]) +
(array[j1+ 4] + array[j1+ 5]) +
(array[j1+ 6] + array[j1+ 7]) +
(array[j1+ 8] + array[j1+ 9]) +
(array[j1+10] + array[j1+11]) +
(array[j1+12] + array[j1+13]) +
(array[j1+14] + array[j1+15]) +
(array[j1+16] + array[j1+17]) +
(array[j1+18] + array[j1+19]);
j1 = j2 + 20;
sum = sum +
(array[j2 ] + array[j2+ 1]) +
(array[j2+ 2] + array[j2+ 3]) +
(array[j2+ 4] + array[j2+ 5]) +
(array[j2+ 6] + array[j2+ 7]) +
(array[j2+ 8] + array[j2+ 9]) +
(array[j2+10] + array[j2+11]) +
(array[j2+12] + array[j2+13]) +
(array[j2+14] + array[j2+15]) +
(array[j2+16] + array[j2+17]) +
(array[j2+18] + array[j2+19]);
}
while (j1 < ARRAY_SIZE);
This uses a total unroll amount of 40, split into two groups of 20, with alternating induction variables that are pre-incremenented to break dependencies, and a post-tested loop. Again, you can experiment with the parentheses groupings to fine-tune it for your compiler and platform.
I tried your code with the following approaches:
No optimization, for loop with integer indexes by 1, simple sum +=. This took 16.4 seconds on my 64 bit 2011 MacBook Pro.
gcc -O2, same code, got down to 5.46 seconds.
gcc -O3, same code, got down to 5.45 seconds.
I tried using your code with 8-way addition into the sum variable. This took it down to 2.03 seconds.
I doubled that to 16-way additon into the sum variable, this took it down to 1.91 seconds.
I doubled that to 32-way addition into the sum variable. The time WENT UP to 2.08 seconds.
I switched to a pointer approach, as suggested by #kcraigie. With -O3, the time was 6.01 seconds. (Very surprising to me!)
register double * p;
for (p = array; p < array + ARRAY_SIZE; ++p) {
sum += *p;
}
I changed the for loop to a while loop, with sum += *p++ and got the time down to 5.64 seconds.
I changed the while loop to count down instead of up, the time went up to 5.88 seconds.
I changed back to a for loop with incrementing-by-8 integer index, added 8 register double sum[0-7] variables, and added _array[j+N] to sumN for N in [0,7]. With _array declared to be a register double *const initialized to array, on the chance that it matters. This got the time down to 1.86 seconds.
I changed to a macro that expanded to 10,000 copies of +_array[n], with N a constant. Then I did sum = tnKX(addsum) and the compiler crashed with a segmentation fault. So a pure-inlining approach isn't going to work.
I switched to a macro that expanded to 10,000 copies of sum += _array[n] with N a constant. That ran in 6.63 seconds!! Apparently the overhead of loading all that code reduces the effectiveness of the inlining.
I tried declaring a static double _array[ARRAY_SIZE]; and then using __builtin_memcpy to copy it before the first loop. With the 8-way parallel addition, this resulted in a time of 2.96 seconds. I don't think static array is the way to go. (Sad - I was hoping the constant address would be a winner.)
From all this, it seems like 16-way inlining or 8-way parallel variables should be the way to go. You'll have to try this on your own platform to make sure - I don't know what the wider architecture will do to the numbers.
Edit:
Following a suggestion from #pvg, I added this code:
int ntimes = 0;
// ... and this one.
...
// You can change anything between this comment ...
if (ntimes++ == 0) {
Which reduced the run time to < 0.01 seconds. ;-) It's a winner, if you don't get hit with the F-stick.
Original Problem: Problem 1 (INOI 2015)
There are two arrays A[1..N] and B[1..N]
An operation SSum is defined on them as
SSum[i,j] = A[i] + A[j] + B[t (where t = i+1, i+2, ..., j-1)] when i < j
SSum[i,j] = A[i] + A[j] + B[t (where t = 1, 2, ..., j-1, i+1, i+2, ..., N)] when i > j
SSum[i,i] = A[i]
The challenge is to find the largest possible value of SSum.
I had an O(n^2) solution based on computing the Prefix Sums of B
#include <iostream>
#include <utility>
int main(){
int N;
std::cin >> N;
int *a = new int[N+1];
long long int *bPrefixSums = new long long int[N+1];
for (int iii=1; iii<=N; iii++) //1-based arrays to prevent confusion
std::cin >> a[iii];
bPrefixSums[0] = 0;
for (int b,iii=1; iii<=N; iii++){
std::cin >> b;
bPrefixSums[iii] = bPrefixSums[iii-1] + b;
}
long long int SSum, SSumMax=-(1<<10);
for (int i=1; i <= N; i++)
for (int j=1; j <= N; j++){
if (i<j)
SSum = a[i] + a[j] + (bPrefixSums[j-1] - bPrefixSums[i]);
else if (i==j)
SSum = a[i];
else
SSum = a[i] + a[j] + ((bPrefixSums[N] - bPrefixSums[i]) + bPrefixSums[j-1]);
SSumMax = std::max(SSum, SSumMax);
}
std::cout << SSumMax;
return 0;
}
For larger values of N around 10^6, the program fails to complete the task in 3 seconds.
Since I didn't get enough rep to add a comment, I shall just write the ideas here in this answer.
This problem is really nice, and I was actually inspired by this link. Thanks to #superty.
We may consider this problem separately, in other words, into three conditions: i == j, i < j, i > j. And we only need to find the maximum result.
Consider i == j: The maximum result should be a[i], and it's easy to find the answer in O(n) time complexity.
Consider i < j: It's quite similar to the classical maximum sum problem, and for each j we only need to find the i in the left which manages to make the result maximum.
Think about the classical problem first, if we are asked to get the maximum partial sum for array a, we calculate the prefix-sum of a in order to get an O(n) complexity. Now in this problem, it is almost the same.
You can see that here(i < j), we have SSum[i,j] = A[i] + A[j] + B[t (where t = i+1, i+2, ..., j-1)] = (B[1] + B[2] + ... + B[j - 1] + A[j]) - (B[1] + B[2] + ... B[i] - A[i]), and the first term stays the same when j stays the same while the second term stays the same when i stays the same. So the solution now is quite clear, you get two 'prefix-sum' and find the smallest prefix_sum_2[i] for each prefix_sum_1[j].
Consider i > j: It's quite similar with this discussion on SO(but this discussion doesn't help much).
Similarly, we get SSum[i,j] = A[i] + A[j] + B[t (where t = 1, 2, ..., j-1, i+1, i+2, ..., N)] = (B[1] + B[2] + ... + B[j - 1] + A[j]) + (A[i] + B[i + 1] + ... + B[n - 1] + B[n]). Now you need to get both the prefix-sum and the suffix-sum of the array (we need prefix_sum[i] = a[i] + prefix_sum[i - 1] - a[i - 1] and suffix similarly), and get another two arrays, say ans_left[i] as the maximum value of the first term for all j <= i and ans_right[j] as the maximum value of the second term for i >= j, so the answer in this condition is the maximum value among all (ans_left[i] + ans_right[i + 1])
Finally, the maximum result required for the original problem is the maximum of the answers for these three sub-cases.
It's clear to see that the total complexity is O(n).