OpenMP drastically slows down for loop - c

I am attempting to speed up this for loop with OpenMP parallelization. I was under the impression that this should split up the work across a number of threads. However, perhaps the overhead is too large for this to give me any speedup.
I should mention that this loop occurs many many many times, and each instance of the loop should be parallelized. The number of loop iterations, newNx, can be as small as 3 or as large as 256. However, if I conditionally have it parallelized only for newNx > 100 (only the largest loops), it still slows down significantly.
Is there anything in here which would cause this to be slower than anticipated? I should also mention that the vectors A,v,b are VERY large, but access is O(1) I believe.
#pragma omp parallel for private(j,k),shared(A,v,b)
for(i=1;i<=newNx;i+=2) {
for(j=1;j<=newNy;j++) {
for(k=1;k<=newNz;k+=1) {
nynz=newNy*newNz;
v[(i-1)*nynz+(j-1)*newNz+k] =
-(v[(i-1)*nynz+(j-1)*newNz+k+1 - 2*(k/newNz)]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + kup+offA] +
v[(i-1)*nynz+(j-1)*newNz+ k-1+2*(1/k)]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + kdo+offA] +
v[(i-1)*nynz+(j - 2*(j/newNy))*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + jup+offA] +
v[(i-1)*nynz+(j-2 + 2*(1/j))*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + jdo+offA] +
v[(i - 2*(i/newNx))*nynz+(j-1)*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + iup+offA] +
v[(i-2 + 2*(1/i))*nynz+(j-1)*newNz+k]*A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + ido+offA] -
b[(i-1)*nynz + (j-1)*newNz + k])
/A[((i-1)*nynz + (j-1)*newNz + (k-1))*spN + ifi+offA];}}}

Assuming you don't have a race condition you can try fusing the loops. Fusing will give larger chunks to parallelize which will help reduce the effect of false sharing and likely distribute the load better as well.
For a triple loop like this
for(int i2=0; i2<x; i2++) {
for(int j2=0; j2<y; j2++) {
for(int k2=0; k2<z; k2++) {
//
}
}
}
you can fuse it like this
#pragma omp parallel for
for(int n=0; n<(x*y*z); n++) {
int i2 = n/(y*z);
int j2 = (n%(y*z))/z;
int k2 = (n%(y*z))%z;
//
}
In your case you you can do it like this
int i, j, k, n;
int x = newNx%2 ? newNx/2+1 : newNx/2;
int y = newNy;
int z = newNz;
#pragma omp parallel for private(i, j, k)
for(n=0; n<(x*y*z); n++) {
i = 2*(n/(y*z)) + 1;
j = (n%(y*z))/z + 1;
k = (n%(y*z))%z + 1;
// rest of code
}
If this successfully speed up your code then you can feel good that you made your code faster and at the same time obfuscated it even further.

Related

Matrix Subset Operations in Array

I have a matrix multiplication problem. We have an image matrix which can be have variable size. It is required to calculate C = A*B for every possible nxn. C will be added to output image as seen in figure. The center point of A Matrix is located in the lower triangle. Also, B is placed diagonally symmetric to A. A can be overlap, so, B can be overlap too. Figures can be seen in below for more detailed understand:
Blue X points represent all possible mid points of A. Algorithm should just do multiply A and diagonally mirrored version of A or called B. I done it with lots of for loop. I need to reduce number of for that I used. Could you help me please?
What kind of algorithm can be used for this problem? I have some confusing points.
Could you please help me with your genius algorithm talents? Or could you direct me to an expert?
Original Questions is below:
Thanks.
Update:
#define SIZE_ARRAY 20
#define SIZE_WINDOW 5
#define WINDOW_OFFSET 2
#define INDEX_OFFSET 1
#define START_OFFSET_COLUMN 2
#define START_OFFSET_ROW 3
#define END_OFFSET_COLUMN 3
#define END_OFFSET_ROW 2
#define GET_LOWER_DIAGONAL_INDEX_MIN_ROW (START_OFFSET_ROW);
#define GET_LOWER_DIAGONAL_INDEX_MAX_ROW (SIZE_ARRAY - INDEX_OFFSET - END_OFFSET_ROW)
#define GET_LOWER_DIAGONAL_INDEX_MIN_COL (START_OFFSET_COLUMN);
#define GET_LOWER_DIAGONAL_INDEX_MAX_COL (SIZE_ARRAY - INDEX_OFFSET - END_OFFSET_COLUMN)
uint32_t lowerDiagonalIndexMinRow = GET_LOWER_DIAGONAL_INDEX_MIN_ROW;
uint32_t lowerDiagonalIndexMaxRow = GET_LOWER_DIAGONAL_INDEX_MAX_ROW;
uint32_t lowerDiagonalIndexMinCol = GET_LOWER_DIAGONAL_INDEX_MIN_COL;
uint32_t lowerDiagonalIndexMaxCol = GET_LOWER_DIAGONAL_INDEX_MAX_COL;
void parallelMultiplication_Stable_Master()
{
startTimeStamp = omp_get_wtime();
#pragma omp parallel for num_threads(8) private(outerIterRow, outerIterCol,rA,cA,rB,cB) shared(inputImage, outputImage)
for(outerIterRow = lowerDiagonalIndexMinRow; outerIterRow < lowerDiagonalIndexMaxRow; outerIterRow++)
{
for(outerIterCol = lowerDiagonalIndexMinCol; outerIterCol < lowerDiagonalIndexMaxCol; outerIterCol++)
{
if(outerIterCol + 1 < outerIterRow)
{
rA = outerIterRow - WINDOW_OFFSET;
cA = outerIterCol - WINDOW_OFFSET;
rB = outerIterCol - WINDOW_OFFSET;
cB = outerIterRow - WINDOW_OFFSET;
for(i= outerIterRow - WINDOW_OFFSET; i <= outerIterRow + WINDOW_OFFSET; i++)
{
for(j= outerIterCol - WINDOW_OFFSET; j <= outerIterCol + WINDOW_OFFSET; j++)
{
for(k=0; k < SIZE_WINDOW; k++)
{
#pragma omp critical
outputImage[i][j] += inputImage[rA][cA+k] * inputImage[rB+k][cB];
}
cB++;
rA++;
}
rB++;
cA++;
printf("Thread Number - %d",omp_get_thread_num());
}
}
}
}
stopTimeStamp = omp_get_wtime();
printArray(outputImage,"Output Image");
printConsoleNotification(100, startTimeStamp, stopTimeStamp);
}
I am getting segmentation fault error if I set up thread count more than "1". What is the trick ?
I'm not providing a solution, but some thoughts that may help the OP exploring a possible approach.
You can evaluate each element of the resulting C matrix directly, from the values of the original matrix in a way similar to a convolution operation.
Consider the following image (sorry if it's confusing):
Instead of computing each matrix product for every A submatrix, you can evaluate the value of each Ci, j from the values in the shaded areas.
Note that Ci, j depends only on a small subset of row i and that the elements of the upper right triangular submatrix (where the B submatrices are picked) could be copied and maybe transposed in a more chache-friendly accomodation.
Alternatively, it may be worth exploring an approach where for every possible Bi, j, all the corresponding elements of C are evaluated.
Edit
Note that you can actually save a lot of calculations (and maybe cache misses) by grouping the terms, see e.g. the first two elements of row i in A:
More formally
Ci,j = Ai,j-4 · (Bj-4,i + Bj-4,i+1 + Bj-4,i+2 + Bj-4,i+3 + Bj-4,i+4)
Ci,j += Ai,j-3 · (Bj-3,i-1 + Bj-3,i+4 + 2·(Bj-3,i + Bj-3,i+1 + Bj-3,i+2 + Bj-3,i+3))
Ci,j += Ai,j-2 · (Bj-2,i-2 + Bj-2,i+4 + 2·(Bj-2,i-1 + Bj-2,i+3) + 3·(Bj-2,i + Bj-2,i+1 + Bj-2,i+2))
Ci,j += Ai,j-1 · (Bj-1,i-3 + Bj-1,i+4 + 2·(Bj-1,i-2 + Bj-1,i+3) + 3·(Bj-1,i-1 + Bj-1,i+2) + 4·(Bj-1,i + Bj-1,i+1))
Ci,j += Ai,j · (Bj,i-4 + Bj,i+4 + 2·(Bj,i-3 + Bj,i+3) + 3·(Bj,i-2 + Bj,i+2) + 4·(Bj,i-1 + Bj,i+1) + 5·Bj,i)
Ci,j += Ai,j+1 · (Bj+1,i-4 + Bj+1,i+3 + 2·(Bj+1,i-3 + Bj+1,i+2) + 3·(Bj+1,i-2 + Bj+1,i+1) + 4·(Bj+1,i-1 + Bj+1,i))
Ci,j += Ai,j+2 · (Bj+2,i-4 + Bj+2,i+2 + 2·(Bj+2,i-3 + Bj+2,i+1) + 3·(Bj+2,i-2 + Bj+2,i-1 + Bj+2,i))
Ci,j += Ai,j+3 · (Bj+3,i-4 + Bj+3,i+1 + 2·(Bj+3,i-3 + Bj+3,i-2 + Bj+3,i-1 + Bj+3,i))
Ci,j += Ai,j+4 · (Bj+4,i-4 + Bj+4,i-3 + Bj+4,i-2 + Bj+4,i-1 + Bj+4,i)
If I correctly estimated, this requires something like 60 additions and 25 (possibly fused) multiplications, compared to 125 operations like Ci,j += Ai,k · Bk,i spread all over the places.
I think that cache-locality may have a bigger impact on performance than the mere reduction of operations.
We could also precompute all the values
Si,j = Bj,i + Bj,i+1 + Bj,i+2 + Bj,i+3 + Bj,i+4
Then the previous formulas become
Ci,j = Ai,j-4 · Sj-4,i
Ci,j += Ai,j-3 · (Sj-3,i-1 + Sj-3,i)
Ci,j += Ai,j-2 · (Sj-2,i-2 + Sj-2,i-1 + Sj-2,i)
Ci,j += Ai,j-1 · (Sj-1,i-3 + Sj-1,i-2 + Sj-1,i-1 + Sj-1,i)
Ci,j += Ai,j · (Sj,i-4 + Sj,i-3 + Sj,i-2 + Sj,i-1 + Sj,i)
Ci,j += Ai,j+1 · (Sj+1,i-4 + Sj+1,i-3 + Sj+1,i-2 + Sj+1,i-1)
Ci,j += Ai,j+2 · (Sj+2,i-4 + Sj+2,i-3 + Sj+2,i-2)
Ci,j += Ai,j+3 · (Sj+3,i-4 + Sj+3,i-3)
Ci,j += Ai,j+4 · Sj+4,i-4
Here is my take. I wrote this before OP showed any code, so I'm not following any of their code patterns.
I start with a suitable image struct, just for my own sanity.
struct Image
{
float* values;
int rows, cols;
};
struct Image image_allocate(int rows, int cols)
{
struct Image rtrn;
rtrn.rows = rows;
rtrn.cols = cols;
rtrn.values = malloc(sizeof(float) * rows * cols);
return rtrn;
}
void image_fill(struct Image* img)
{
ptrdiff_t row, col;
for(row = 0; row < img->rows; ++row)
for(col = 0; col < img->cols; ++col)
img->values[row * img->cols + col] = rand() * (1.f / RAND_MAX);
}
void image_print(const struct Image* img)
{
ptrdiff_t row, col;
for(row = 0; row < img->rows; ++row) {
for(col = 0; col < img->cols; ++col)
printf("%.3f ", img->values[row * img->cols + col]);
putchar('\n');
}
putchar('\n');
}
A 5x5 matrix multiplication is too small to reasonably dispatch to BLAS. So I write a simple version myself that can be loop-unrolled and / or inlined. This routine could use a couple of micro-optimizations but let's keep it simple for now.
/** out += left * right for 5x5 sub-matrices */
static void mat_mul_5x5(
float* restrict out, const float* left, const float* right, int cols)
{
ptrdiff_t row, col, inner;
float sum;
for(row = 0; row < 5; ++row) {
for(col = 0; col < 5; ++col) {
sum = out[row * cols + col];
for(inner = 0; inner < 5; ++inner)
sum += left[row * cols + inner] * right[inner * cols + col];
out[row * cols + col] = sum;
}
}
}
Now for the single-threaded implementation of the main algorithm. Again, nothing fancy. We just iterate over the lower triangular matrix, excluding the diagonal. I keep track of the top-left corner instead of the center point. Makes index computation a bit simpler.
void compute_ltr(struct Image* restrict out, const struct Image* in)
{
ptrdiff_t top, left, end;
/* if image is not quadratic, find quadratic subset */
end = out->rows < out->cols ? out->rows : out->cols;
assert(in->rows == out->rows && in->cols == out->cols);
memset(out->values, 0, sizeof(float) * out->rows * out->cols);
for(top = 1; top <= end - 5; ++top)
for(left = 0; left < top; ++left)
mat_mul_5x5(out->values + top * out->cols + left,
in->values + top * in->cols + left,
in->values + left * in->cols + top,
in->cols);
}
The parallelization is a bit tricky because we have to make sure the threads don't overlap in their output matrices. A critical section, atomics or similar stuff would cost too much performance.
A simpler solution is a strided approach: If we always keep the threads 5 rows apart, they cannot interfere. So we simply compute every fifth row, synchronize all threads, then compute the next set of rows, five apart, and so on.
void compute_ltr_parallel(struct Image* restrict out, const struct Image* in)
{
/* if image is not quadratic, find quadratic subset */
const ptrdiff_t end = out->rows < out->cols ? out->rows : out->cols;
assert(in->rows == out->rows && in->cols == out->cols);
memset(out->values, 0, sizeof(float) * out->rows * out->cols);
/*
* Keep the parallel section open for multiple loops to reduce
* overhead
*/
# pragma omp parallel
{
ptrdiff_t top, left, offset;
for(offset = 0; offset < 5; ++offset) {
/* Use dynamic scheduling because the work per row varies */
# pragma omp for schedule(dynamic)
for(top = 1 + offset; top <= end - 5; top += 5)
for(left = 0; left < top; ++left)
mat_mul_5x5(out->values + top * out->cols + left,
in->values + top * in->cols + left,
in->values + left * in->cols + top,
in->cols);
}
}
}
My benchmark with 1000 iterations of a 1000x1000 image show 7 seconds for the serial version and 1.2 seconds for the parallelized version on my 8 core / 16 thread CPU.
EDIT for completeness: Here are the includes and the main for benchmarking.
#include <assert.h>
#include <stddef.h>
/* using ptrdiff_t */
#include <stdlib.h>
/* using malloc */
#include <stdio.h>
/* using printf */
#include <string.h>
/* using memset */
/* Insert code from above here */
int main()
{
int rows = 1000, cols = 1000, rep = 1000;
struct Image in, out;
in = image_allocate(rows, cols);
out = image_allocate(rows, cols);
image_fill(&in);
# if 1
do
compute_ltr_parallel(&out, &in);
while(--rep);
# else
do
compute_ltr(&out, &in);
while(--rep);
# endif
}
Compile with gcc -O3 -fopenmp.
Regarding the comment, and also your way of using OpenMP: Don't overcomplicate things with unnecessary directives. OpenMP can figure out how many threads are available itself. And private variables can easily be declared within the parallel section (usually).
If you want a specific number of threads, just call with the appropriate environment variable, e.g. on Linux call OMP_NUM_THREADS=8 ./executable

Accurate method for finding the time complexity of a function

How to find the time complexity of this function:
Code
void f(int n)
{
for(int i=0; i<n; ++i)
for(int j=0; j<i; ++j)
for(int k=i*j; k>0; k/=2)
printf("~");
}
I took an educated guess of (n^2)*log(n) based on intuition and it turned out to be correct.
But I can't seem to find an accurate explanation for it.
For every value of i, i>0, there will be i-1 values of the inner loop, each of them for k starting respectively at:
i*1, i*2, ..., i(i-1)
Since k is divided by 2 until it reaches 0, each of these inner-inner loops require lg(k) steps. Hence
lg(i*1) + lg(i*2) + ... + lg(i(i-1)) = lg(i) + lg(i) + lg(2) + ... + lg(i) + lg(i-1)
= (i-1)lg(i) + lg(2) + ... + lg(i-1)
Therefore the total would be
f(n) ::= sum_{i=1}^{n-1} i*lg(i) + lg(2) + ... + lg(i-1)
Let's now bound f(n+1) from above:
f(n+1) <= sum_{i-1}^n i*lg(i) + (i-1)lg(i-1)
<= 2*sum_{i-1}^n i*lg(i)
<= C*integral_0^n x(ln x) ; integral bound, some constant C
= C/2(n^2(ln n) - n^2/2) ; integral x*ln(x) = x^2/2*ln(x) - x^2/4
= O(n^2*lg(n))
If we now bound f(n+1) from below:
f(n+1) >= sum_{i=1}^n i*lg(i)
>= C*integral_0^n x(ln x) ; integral bound
= C*(n^2*ln(n)/2 - n^2/4) ; integral x*ln(x) = x^2/2*ln(x) - x^2/4
>= C/4(n^2*ln(n))
= O(n^2*lg(n))

Loop Splitting makes code slower

So I'm optimizing a loop (as homework) that adds 10,000 elements 600,000 times. The time without optimizations is 23.34s~ and my goal is to reach less than 7 seconds for a B and less than 5 for an A.
So I started my optimizations by first unrolling the loop like this.
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7];
This reduces the runtime to about 6.4~ seconds (I can hit about 6 if I unroll further).
So I figured I would try adding sub-sums and making a final sum at the end to save time on read-write dependencies and I came up with code that looks like this.
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum0 += array[j] + array[j+1];
sum1 += array[j+2] + array[j+3];
sum2 += array[j+4] + array[j+5];
sum3 += array[j+6] + array[j+7];
However this increases the runtime to about 6.8 seconds
I tried a similar technique using pointers and the best I could do was about 15 seconds.
I only know that the machine I'm running this on (as it is a service purchased by the school) is a 32 bit, remote, Intel based, Linux virtual server that I believe is running Red Hat.
I've tried every technique I can think of to speed up the code, but they all seem to have the opposite effect. Could someone elaborate on what I'm doing wrong? Or another technique I could use to lower the runtime? The best the teacher could do was about 4.8 seconds.
As an additional condition I cannot have more than 50 lines of code in the finished project, so doing something complex is likely not possible.
Here is a full copy of both sources
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
// double sum0 = 0;
// double sum1 = 0;
// double sum2 = 0;
// double sum3 = 0;
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - ACTUAL NAME\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7];
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
// sum = sum0 + sum1 + sum2 + sum3;
// ... and this one.
return 0;
}
Broken up code
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
double sum0 = 0;
double sum1 = 0;
double sum2 = 0;
double sum3 = 0;
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - ACTUAL NAME\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum0 += array[j] + array[j+1];
sum1 += array[j+2] + array[j+3];
sum2 += array[j+4] + array[j+5];
sum3 += array[j+6] + array[j+7];
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
sum = sum0 + sum1 + sum2 + sum3;
// ... and this one.
return 0;
}
ANSWER
The 'time' application we use to judge the grade is a little bit off. The best I could do was 4.9~ by unrolling the loop 50 times and grouping it like I did below using TomKarzes's basic format.
int j;
for (j = 0; j < ARRAY_SIZE; j += 50) {
sum +=(((((((array[j] + array[j+1]) + (array[j+2] + array[j+3])) +
((array[j+4] + array[j+5]) + (array[j+6] + array[j+7]))) +
(((array[j+8] + array[j+9]) + (array[j+10] + array[j+11])) +
((array[j+12] + array[j+13]) + (array[j+14] + array[j+15])))) +
((((array[j+16] + array[j+17]) + (array[j+18] + array[j+19]))))) +
(((((array[j+20] + array[j+21]) + (array[j+22] + array[j+23])) +
((array[j+24] + array[j+25]) + (array[j+26] + array[j+27]))) +
(((array[j+28] + array[j+29]) + (array[j+30] + array[j+31])) +
((array[j+32] + array[j+33]) + (array[j+34] + array[j+35])))) +
((((array[j+36] + array[j+37]) + (array[j+38] + array[j+39])))))) +
((((array[j+40] + array[j+41]) + (array[j+42] + array[j+43])) +
((array[j+44] + array[j+45]) + (array[j+46] + array[j+47]))) +
(array[j+48] + array[j+49])));
}
I experimented with the grouping a bit. On my machine, with my gcc, I found that the following worked best:
for (j = 0; j < ARRAY_SIZE; j += 16) {
sum = sum +
(array[j ] + array[j+ 1]) +
(array[j+ 2] + array[j+ 3]) +
(array[j+ 4] + array[j+ 5]) +
(array[j+ 6] + array[j+ 7]) +
(array[j+ 8] + array[j+ 9]) +
(array[j+10] + array[j+11]) +
(array[j+12] + array[j+13]) +
(array[j+14] + array[j+15]);
}
In other words, it's unrolled 16 times, it groups the sums into pairs, and then it adds the pairs linearly. I also removed the += operator, which affects when sum is first used in the additions.
I found that the measured times varied significantly from one run to the next, even without changing anything, so I suggest timing each version several times before making any conclusions about whether the time has improved or gotten worse.
I'd be interested to know what numbers you get on your machine with this version of the inner loop.
Update: Here's my current fastest version (on my machine, with my compiler):
int j1, j2;
j1 = 0;
do {
j2 = j1 + 20;
sum = sum +
(array[j1 ] + array[j1+ 1]) +
(array[j1+ 2] + array[j1+ 3]) +
(array[j1+ 4] + array[j1+ 5]) +
(array[j1+ 6] + array[j1+ 7]) +
(array[j1+ 8] + array[j1+ 9]) +
(array[j1+10] + array[j1+11]) +
(array[j1+12] + array[j1+13]) +
(array[j1+14] + array[j1+15]) +
(array[j1+16] + array[j1+17]) +
(array[j1+18] + array[j1+19]);
j1 = j2 + 20;
sum = sum +
(array[j2 ] + array[j2+ 1]) +
(array[j2+ 2] + array[j2+ 3]) +
(array[j2+ 4] + array[j2+ 5]) +
(array[j2+ 6] + array[j2+ 7]) +
(array[j2+ 8] + array[j2+ 9]) +
(array[j2+10] + array[j2+11]) +
(array[j2+12] + array[j2+13]) +
(array[j2+14] + array[j2+15]) +
(array[j2+16] + array[j2+17]) +
(array[j2+18] + array[j2+19]);
}
while (j1 < ARRAY_SIZE);
This uses a total unroll amount of 40, split into two groups of 20, with alternating induction variables that are pre-incremenented to break dependencies, and a post-tested loop. Again, you can experiment with the parentheses groupings to fine-tune it for your compiler and platform.
I tried your code with the following approaches:
No optimization, for loop with integer indexes by 1, simple sum +=. This took 16.4 seconds on my 64 bit 2011 MacBook Pro.
gcc -O2, same code, got down to 5.46 seconds.
gcc -O3, same code, got down to 5.45 seconds.
I tried using your code with 8-way addition into the sum variable. This took it down to 2.03 seconds.
I doubled that to 16-way additon into the sum variable, this took it down to 1.91 seconds.
I doubled that to 32-way addition into the sum variable. The time WENT UP to 2.08 seconds.
I switched to a pointer approach, as suggested by #kcraigie. With -O3, the time was 6.01 seconds. (Very surprising to me!)
register double * p;
for (p = array; p < array + ARRAY_SIZE; ++p) {
sum += *p;
}
I changed the for loop to a while loop, with sum += *p++ and got the time down to 5.64 seconds.
I changed the while loop to count down instead of up, the time went up to 5.88 seconds.
I changed back to a for loop with incrementing-by-8 integer index, added 8 register double sum[0-7] variables, and added _array[j+N] to sumN for N in [0,7]. With _array declared to be a register double *const initialized to array, on the chance that it matters. This got the time down to 1.86 seconds.
I changed to a macro that expanded to 10,000 copies of +_array[n], with N a constant. Then I did sum = tnKX(addsum) and the compiler crashed with a segmentation fault. So a pure-inlining approach isn't going to work.
I switched to a macro that expanded to 10,000 copies of sum += _array[n] with N a constant. That ran in 6.63 seconds!! Apparently the overhead of loading all that code reduces the effectiveness of the inlining.
I tried declaring a static double _array[ARRAY_SIZE]; and then using __builtin_memcpy to copy it before the first loop. With the 8-way parallel addition, this resulted in a time of 2.96 seconds. I don't think static array is the way to go. (Sad - I was hoping the constant address would be a winner.)
From all this, it seems like 16-way inlining or 8-way parallel variables should be the way to go. You'll have to try this on your own platform to make sure - I don't know what the wider architecture will do to the numbers.
Edit:
Following a suggestion from #pvg, I added this code:
int ntimes = 0;
// ... and this one.
...
// You can change anything between this comment ...
if (ntimes++ == 0) {
Which reduced the run time to < 0.01 seconds. ;-) It's a winner, if you don't get hit with the F-stick.

Largest Slice Sum from Two Different Arrays

Original Problem: Problem 1 (INOI 2015)
There are two arrays A[1..N] and B[1..N]
An operation SSum is defined on them as
SSum[i,j] = A[i] + A[j] + B[t (where t = i+1, i+2, ..., j-1)] when i < j
SSum[i,j] = A[i] + A[j] + B[t (where t = 1, 2, ..., j-1, i+1, i+2, ..., N)] when i > j
SSum[i,i] = A[i]
The challenge is to find the largest possible value of SSum.
I had an O(n^2) solution based on computing the Prefix Sums of B
#include <iostream>
#include <utility>
int main(){
int N;
std::cin >> N;
int *a = new int[N+1];
long long int *bPrefixSums = new long long int[N+1];
for (int iii=1; iii<=N; iii++) //1-based arrays to prevent confusion
std::cin >> a[iii];
bPrefixSums[0] = 0;
for (int b,iii=1; iii<=N; iii++){
std::cin >> b;
bPrefixSums[iii] = bPrefixSums[iii-1] + b;
}
long long int SSum, SSumMax=-(1<<10);
for (int i=1; i <= N; i++)
for (int j=1; j <= N; j++){
if (i<j)
SSum = a[i] + a[j] + (bPrefixSums[j-1] - bPrefixSums[i]);
else if (i==j)
SSum = a[i];
else
SSum = a[i] + a[j] + ((bPrefixSums[N] - bPrefixSums[i]) + bPrefixSums[j-1]);
SSumMax = std::max(SSum, SSumMax);
}
std::cout << SSumMax;
return 0;
}
For larger values of N around 10^6, the program fails to complete the task in 3 seconds.
Since I didn't get enough rep to add a comment, I shall just write the ideas here in this answer.
This problem is really nice, and I was actually inspired by this link. Thanks to #superty.
We may consider this problem separately, in other words, into three conditions: i == j, i < j, i > j. And we only need to find the maximum result.
Consider i == j: The maximum result should be a[i], and it's easy to find the answer in O(n) time complexity.
Consider i < j: It's quite similar to the classical maximum sum problem, and for each j we only need to find the i in the left which manages to make the result maximum.
Think about the classical problem first, if we are asked to get the maximum partial sum for array a, we calculate the prefix-sum of a in order to get an O(n) complexity. Now in this problem, it is almost the same.
You can see that here(i < j), we have SSum[i,j] = A[i] + A[j] + B[t (where t = i+1, i+2, ..., j-1)] = (B[1] + B[2] + ... + B[j - 1] + A[j]) - (B[1] + B[2] + ... B[i] - A[i]), and the first term stays the same when j stays the same while the second term stays the same when i stays the same. So the solution now is quite clear, you get two 'prefix-sum' and find the smallest prefix_sum_2[i] for each prefix_sum_1[j].
Consider i > j: It's quite similar with this discussion on SO(but this discussion doesn't help much).
Similarly, we get SSum[i,j] = A[i] + A[j] + B[t (where t = 1, 2, ..., j-1, i+1, i+2, ..., N)] = (B[1] + B[2] + ... + B[j - 1] + A[j]) + (A[i] + B[i + 1] + ... + B[n - 1] + B[n]). Now you need to get both the prefix-sum and the suffix-sum of the array (we need prefix_sum[i] = a[i] + prefix_sum[i - 1] - a[i - 1] and suffix similarly), and get another two arrays, say ans_left[i] as the maximum value of the first term for all j <= i and ans_right[j] as the maximum value of the second term for i >= j, so the answer in this condition is the maximum value among all (ans_left[i] + ans_right[i + 1])
Finally, the maximum result required for the original problem is the maximum of the answers for these three sub-cases.
It's clear to see that the total complexity is O(n).

How to use Armadillo Columns/Rows to perform optimised calculations on accesses within the same column

What is the best way to manipulate indexing in Armadillo? I was under the impression that it heavily used template expressions to avoid temporaries, but I'm not seeing these speedups.
Is direct array indexing still the best way to approach calculations that rely on consecutive elements within the same array?
Keep in mind, that I hope to parallelise these calculations in the future with TBB::parallel_for (In this case, from a maintainability perspective, it may be simpler to use direct accessing?) These calculations happen in a tight loop, and I hope to make them as optimal as possible.
ElapsedTimer timer;
int n = 768000;
int numberOfLoops = 5000;
arma::Col<double> directAccess1(n);
arma::Col<double> directAccess2(n);
arma::Col<double> directAccessResult1(n);
arma::Col<double> directAccessResult2(n);
arma::Col<double> armaAccess1(n);
arma::Col<double> armaAccess2(n);
arma::Col<double> armaAccessResult1(n);
arma::Col<double> armaAccessResult2(n);
std::valarray<double> valArrayAccess1(n);
std::valarray<double> valArrayAccess2(n);
std::valarray<double> valArrayAccessResult1(n);
std::valarray<double> valArrayAccessResult2(n);
// Prefil
for (int i = 0; i < n; i++) {
directAccess1[i] = i;
directAccess2[i] = n - i;
armaAccess1[i] = i;
armaAccess2[i] = n - i;
valArrayAccess1[i] = i;
valArrayAccess2[i] = n - i;
}
timer.Start();
for (int j = 0; j < numberOfLoops; j++) {
for (int i = 1; i < n; i++) {
directAccessResult1[i] = -directAccess1[i] / (directAccess1[i] + directAccess1[i - 1]) * directAccess2[i - 1];
directAccessResult2[i] = -directAccess1[i] / (directAccess1[i] + directAccess1[i]) * directAccess2[i];
}
}
timer.StopAndPrint("Direct Array Indexing Took");
std::cout << std::endl;
timer.Start();
for (int j = 0; j < numberOfLoops; j++) {
armaAccessResult1.rows(1, n - 1) = -armaAccess1.rows(1, n - 1) / (armaAccess1.rows(1, n - 1) + armaAccess1.rows(0, n - 2)) % armaAccess2.rows(0, n - 2);
armaAccessResult2.rows(1, n - 1) = -armaAccess1.rows(1, n - 1) / (armaAccess1.rows(1, n - 1) + armaAccess1.rows(1, n - 1)) % armaAccess2.rows(1, n - 1);
}
timer.StopAndPrint("Arma Array Indexing Took");
std::cout << std::endl;
timer.Start();
for (int j = 0; j < numberOfLoops; j++) {
for (int i = 1; i < n; i++) {
valArrayAccessResult1[i] = -valArrayAccess1[i] / (valArrayAccess1[i] + valArrayAccess1[i - 1]) * valArrayAccess2[i - 1];
valArrayAccessResult2[i] = -valArrayAccess1[i] / (valArrayAccess1[i] + valArrayAccess1[i]) * valArrayAccess2[i];
}
}
timer.StopAndPrint("Valarray Array Indexing Took:");
std::cout << std::endl;
In vs release mode (/02 - to avoid armadillo array indexing checks), they produce the following timings:
Started Performance Analysis!
Direct Array Indexing Took: 37.294 seconds elapsed
Arma Array Indexing Took: 39.4292 seconds elapsed
Valarray Array Indexing Took:: 37.2354 seconds elapsed
Your direct code is already quite optimal, so expression templates are not going to help here.
However, you may want to make sure the optimization level in your compiler actually enables auto-vectorization (-O3 in gcc). Secondly, you can get a bit of extra speed by #define ARMA_NO_DEBUG before including the Armadillo header. This will turn off all run-time checks (such as bound checks for element access), but this is not recommended until you have completely debugged your program.

Resources