Matrix Subset Operations in Array

Matrix Subset Operations in Array - c

I have a matrix multiplication problem. We have an image matrix which can be have variable size. It is required to calculate C = A*B for every possible nxn. C will be added to output image as seen in figure. The center point of A Matrix is located in the lower triangle. Also, B is placed diagonally symmetric to A. A can be overlap, so, B can be overlap too. Figures can be seen in below for more detailed understand:
Blue X points represent all possible mid points of A. Algorithm should just do multiply A and diagonally mirrored version of A or called B. I done it with lots of for loop. I need to reduce number of for that I used. Could you help me please?
What kind of algorithm can be used for this problem? I have some confusing points.
Could you please help me with your genius algorithm talents? Or could you direct me to an expert?
Original Questions is below:
Thanks.
Update:
#define SIZE_ARRAY 20
#define SIZE_WINDOW 5
#define WINDOW_OFFSET 2
#define INDEX_OFFSET 1
#define START_OFFSET_COLUMN 2
#define START_OFFSET_ROW 3
#define END_OFFSET_COLUMN 3
#define END_OFFSET_ROW 2
#define GET_LOWER_DIAGONAL_INDEX_MIN_ROW (START_OFFSET_ROW);
#define GET_LOWER_DIAGONAL_INDEX_MAX_ROW (SIZE_ARRAY - INDEX_OFFSET - END_OFFSET_ROW)
#define GET_LOWER_DIAGONAL_INDEX_MIN_COL (START_OFFSET_COLUMN);
#define GET_LOWER_DIAGONAL_INDEX_MAX_COL (SIZE_ARRAY - INDEX_OFFSET - END_OFFSET_COLUMN)
uint32_t lowerDiagonalIndexMinRow = GET_LOWER_DIAGONAL_INDEX_MIN_ROW;
uint32_t lowerDiagonalIndexMaxRow = GET_LOWER_DIAGONAL_INDEX_MAX_ROW;
uint32_t lowerDiagonalIndexMinCol = GET_LOWER_DIAGONAL_INDEX_MIN_COL;
uint32_t lowerDiagonalIndexMaxCol = GET_LOWER_DIAGONAL_INDEX_MAX_COL;
void parallelMultiplication_Stable_Master()
{
startTimeStamp = omp_get_wtime();
#pragma omp parallel for num_threads(8) private(outerIterRow, outerIterCol,rA,cA,rB,cB) shared(inputImage, outputImage)
for(outerIterRow = lowerDiagonalIndexMinRow; outerIterRow < lowerDiagonalIndexMaxRow; outerIterRow++)
{
for(outerIterCol = lowerDiagonalIndexMinCol; outerIterCol < lowerDiagonalIndexMaxCol; outerIterCol++)
{
if(outerIterCol + 1 < outerIterRow)
{
rA = outerIterRow - WINDOW_OFFSET;
cA = outerIterCol - WINDOW_OFFSET;
rB = outerIterCol - WINDOW_OFFSET;
cB = outerIterRow - WINDOW_OFFSET;
for(i= outerIterRow - WINDOW_OFFSET; i <= outerIterRow + WINDOW_OFFSET; i++)
{
for(j= outerIterCol - WINDOW_OFFSET; j <= outerIterCol + WINDOW_OFFSET; j++)
{
for(k=0; k < SIZE_WINDOW; k++)
{
#pragma omp critical
outputImage[i][j] += inputImage[rA][cA+k] * inputImage[rB+k][cB];
}
cB++;
rA++;
}
rB++;
cA++;
printf("Thread Number - %d",omp_get_thread_num());
}
}
}
}
stopTimeStamp = omp_get_wtime();
printArray(outputImage,"Output Image");
printConsoleNotification(100, startTimeStamp, stopTimeStamp);
}
I am getting segmentation fault error if I set up thread count more than "1". What is the trick ?

I'm not providing a solution, but some thoughts that may help the OP exploring a possible approach.
You can evaluate each element of the resulting C matrix directly, from the values of the original matrix in a way similar to a convolution operation.
Consider the following image (sorry if it's confusing):
Instead of computing each matrix product for every A submatrix, you can evaluate the value of each Ci, j from the values in the shaded areas.
Note that Ci, j depends only on a small subset of row i and that the elements of the upper right triangular submatrix (where the B submatrices are picked) could be copied and maybe transposed in a more chache-friendly accomodation.
Alternatively, it may be worth exploring an approach where for every possible Bi, j, all the corresponding elements of C are evaluated.
Edit
Note that you can actually save a lot of calculations (and maybe cache misses) by grouping the terms, see e.g. the first two elements of row i in A:
More formally
Ci,j = Ai,j-4 · (Bj-4,i + Bj-4,i+1 + Bj-4,i+2 + Bj-4,i+3 + Bj-4,i+4)
Ci,j += Ai,j-3 · (Bj-3,i-1 + Bj-3,i+4 + 2·(Bj-3,i + Bj-3,i+1 + Bj-3,i+2 + Bj-3,i+3))
Ci,j += Ai,j-2 · (Bj-2,i-2 + Bj-2,i+4 + 2·(Bj-2,i-1 + Bj-2,i+3) + 3·(Bj-2,i + Bj-2,i+1 + Bj-2,i+2))
Ci,j += Ai,j-1 · (Bj-1,i-3 + Bj-1,i+4 + 2·(Bj-1,i-2 + Bj-1,i+3) + 3·(Bj-1,i-1 + Bj-1,i+2) + 4·(Bj-1,i + Bj-1,i+1))
Ci,j += Ai,j · (Bj,i-4 + Bj,i+4 + 2·(Bj,i-3 + Bj,i+3) + 3·(Bj,i-2 + Bj,i+2) + 4·(Bj,i-1 + Bj,i+1) + 5·Bj,i)
Ci,j += Ai,j+1 · (Bj+1,i-4 + Bj+1,i+3 + 2·(Bj+1,i-3 + Bj+1,i+2) + 3·(Bj+1,i-2 + Bj+1,i+1) + 4·(Bj+1,i-1 + Bj+1,i))
Ci,j += Ai,j+2 · (Bj+2,i-4 + Bj+2,i+2 + 2·(Bj+2,i-3 + Bj+2,i+1) + 3·(Bj+2,i-2 + Bj+2,i-1 + Bj+2,i))
Ci,j += Ai,j+3 · (Bj+3,i-4 + Bj+3,i+1 + 2·(Bj+3,i-3 + Bj+3,i-2 + Bj+3,i-1 + Bj+3,i))
Ci,j += Ai,j+4 · (Bj+4,i-4 + Bj+4,i-3 + Bj+4,i-2 + Bj+4,i-1 + Bj+4,i)
If I correctly estimated, this requires something like 60 additions and 25 (possibly fused) multiplications, compared to 125 operations like Ci,j += Ai,k · Bk,i spread all over the places.
I think that cache-locality may have a bigger impact on performance than the mere reduction of operations.
We could also precompute all the values
Si,j = Bj,i + Bj,i+1 + Bj,i+2 + Bj,i+3 + Bj,i+4
Then the previous formulas become
Ci,j = Ai,j-4 · Sj-4,i
Ci,j += Ai,j-3 · (Sj-3,i-1 + Sj-3,i)
Ci,j += Ai,j-2 · (Sj-2,i-2 + Sj-2,i-1 + Sj-2,i)
Ci,j += Ai,j-1 · (Sj-1,i-3 + Sj-1,i-2 + Sj-1,i-1 + Sj-1,i)
Ci,j += Ai,j · (Sj,i-4 + Sj,i-3 + Sj,i-2 + Sj,i-1 + Sj,i)
Ci,j += Ai,j+1 · (Sj+1,i-4 + Sj+1,i-3 + Sj+1,i-2 + Sj+1,i-1)
Ci,j += Ai,j+2 · (Sj+2,i-4 + Sj+2,i-3 + Sj+2,i-2)
Ci,j += Ai,j+3 · (Sj+3,i-4 + Sj+3,i-3)
Ci,j += Ai,j+4 · Sj+4,i-4

Here is my take. I wrote this before OP showed any code, so I'm not following any of their code patterns.
I start with a suitable image struct, just for my own sanity.
struct Image
{
float* values;
int rows, cols;
};
struct Image image_allocate(int rows, int cols)
{
struct Image rtrn;
rtrn.rows = rows;
rtrn.cols = cols;
rtrn.values = malloc(sizeof(float) * rows * cols);
return rtrn;
}
void image_fill(struct Image* img)
{
ptrdiff_t row, col;
for(row = 0; row < img->rows; ++row)
for(col = 0; col < img->cols; ++col)
img->values[row * img->cols + col] = rand() * (1.f / RAND_MAX);
}
void image_print(const struct Image* img)
{
ptrdiff_t row, col;
for(row = 0; row < img->rows; ++row) {
for(col = 0; col < img->cols; ++col)
printf("%.3f ", img->values[row * img->cols + col]);
putchar('\n');
}
putchar('\n');
}
A 5x5 matrix multiplication is too small to reasonably dispatch to BLAS. So I write a simple version myself that can be loop-unrolled and / or inlined. This routine could use a couple of micro-optimizations but let's keep it simple for now.
/** out += left * right for 5x5 sub-matrices */
static void mat_mul_5x5(
float* restrict out, const float* left, const float* right, int cols)
{
ptrdiff_t row, col, inner;
float sum;
for(row = 0; row < 5; ++row) {
for(col = 0; col < 5; ++col) {
sum = out[row * cols + col];
for(inner = 0; inner < 5; ++inner)
sum += left[row * cols + inner] * right[inner * cols + col];
out[row * cols + col] = sum;
}
}
}
Now for the single-threaded implementation of the main algorithm. Again, nothing fancy. We just iterate over the lower triangular matrix, excluding the diagonal. I keep track of the top-left corner instead of the center point. Makes index computation a bit simpler.
void compute_ltr(struct Image* restrict out, const struct Image* in)
{
ptrdiff_t top, left, end;
/* if image is not quadratic, find quadratic subset */
end = out->rows < out->cols ? out->rows : out->cols;
assert(in->rows == out->rows && in->cols == out->cols);
memset(out->values, 0, sizeof(float) * out->rows * out->cols);
for(top = 1; top <= end - 5; ++top)
for(left = 0; left < top; ++left)
mat_mul_5x5(out->values + top * out->cols + left,
in->values + top * in->cols + left,
in->values + left * in->cols + top,
in->cols);
}
The parallelization is a bit tricky because we have to make sure the threads don't overlap in their output matrices. A critical section, atomics or similar stuff would cost too much performance.
A simpler solution is a strided approach: If we always keep the threads 5 rows apart, they cannot interfere. So we simply compute every fifth row, synchronize all threads, then compute the next set of rows, five apart, and so on.
void compute_ltr_parallel(struct Image* restrict out, const struct Image* in)
{
/* if image is not quadratic, find quadratic subset */
const ptrdiff_t end = out->rows < out->cols ? out->rows : out->cols;
assert(in->rows == out->rows && in->cols == out->cols);
memset(out->values, 0, sizeof(float) * out->rows * out->cols);
/*
* Keep the parallel section open for multiple loops to reduce
* overhead
*/
# pragma omp parallel
{
ptrdiff_t top, left, offset;
for(offset = 0; offset < 5; ++offset) {
/* Use dynamic scheduling because the work per row varies */
# pragma omp for schedule(dynamic)
for(top = 1 + offset; top <= end - 5; top += 5)
for(left = 0; left < top; ++left)
mat_mul_5x5(out->values + top * out->cols + left,
in->values + top * in->cols + left,
in->values + left * in->cols + top,
in->cols);
}
}
}
My benchmark with 1000 iterations of a 1000x1000 image show 7 seconds for the serial version and 1.2 seconds for the parallelized version on my 8 core / 16 thread CPU.
EDIT for completeness: Here are the includes and the main for benchmarking.
#include <assert.h>
#include <stddef.h>
/* using ptrdiff_t */
#include <stdlib.h>
/* using malloc */
#include <stdio.h>
/* using printf */
#include <string.h>
/* using memset */
/* Insert code from above here */
int main()
{
int rows = 1000, cols = 1000, rep = 1000;
struct Image in, out;
in = image_allocate(rows, cols);
out = image_allocate(rows, cols);
image_fill(&in);
# if 1
do
compute_ltr_parallel(&out, &in);
while(--rep);
# else
do
compute_ltr(&out, &in);
while(--rep);
# endif
}
Compile with gcc -O3 -fopenmp.
Regarding the comment, and also your way of using OpenMP: Don't overcomplicate things with unnecessary directives. OpenMP can figure out how many threads are available itself. And private variables can easily be declared within the parallel section (usually).
If you want a specific number of threads, just call with the appropriate environment variable, e.g. on Linux call OMP_NUM_THREADS=8 ./executable

Related

Optimizing the code for Cascaded Biquad Filters

1I have a 3 stage biquad filter to filter the input signal data (input). While optimizing, I unrolled the entire loop and used load (vld1_s32), multiply-accumulate (vmlal_s32) intrinsics thinking that this optimization will reduce the time spent by the CPU on the code. But, I didn't get better results. Is it possible to optimize the following code in any other way using ARM NEON Intrinsics (SIMD architecture)?
Note: Values of buffer (x) and coefficients (h) are 32-bit integers.
long long int sum;
for(i1=0;i1<100;i1++)
{
temp = input[i1];//32 bit (int) datatype
for(i=0;i<3;i++)
{
sum = x[1+2*i] * h[1+5*i];
sum += x[3+2*i] * h[3+5*i];
sum = sum<<1; //center tapping
sum += x[2+2*i] * h[2+5*i];
sum += x[4+2*i] * h[4+5*i];
sum += temp * h[0+5*i];
x[2+2*i] = x[1+2*i];
x[1+2*i] = temp;
temp = sum>>31;
}
printf("%lld\n",temp);
}
Optimized code: (i believe it's only partial optimization)
int32x2_t x_vec,h_vec;
int64x2_t result_vec;
for(i1=0;i1<100;i1++)
{
temp =input[i1];
for (i = 0; i < 3; i++)
{
result_vec = vdupq_n_s64(0);
for (j = 0; j < 2; j++) // 5 coefficients in total for a biquad - 4 used here
{
x_vec = vld1_s32(&x[1 + 2 * i + 2 * j]);
h_vec = vld1_s32(&h[1 + 5 * i + 2 * j]);
result_vec = vmlal_s32(result_vec, x_vec, h_vec);
}
sum = vadd_s64(result_vec[0], result_vec[1]);
sum = sum + temp * h[0 + 5 * i]) //- remaining one used here
sum += x[1 + 2 * i] * h[1 + 5 * i]);//adding one more time
sum += x[3 + 2 * i] * h[3 + 5 * i];//adding one more time
x[2 + 2 * i] = x[1 + 2 * i];
x[1 + 2 * i] = temp;
temp = sum>>31;
}
printf("%d\n",temp);
}

Indexes 2d array to 1d

I want to transform 2d array to 1d. I put the most important part of my code.
int mask[3][3] = {{0, -1, 0}, {-1, 4, -1}, {0, -1, 0}};
for (i = 1; i < rows - 1; i++) {
for (j = 1; j < cols - 1;j++) {
int s;
s = mask[0][0] * image[i-1][j-1]
+ mask[0][1] * image[i-1][j]
+ mask[0][2] * image[i-1][j+1]
+ mask[1][0] * image[i][j-1]
+ mask[1][1] * image[i][j]
+ mask[1][2] * image[i][j+1]
+ mask[2][0] * image[i+1][j-1]
+ mask[2][1] * image[i+1][j]
+ mask[2][2] * image[i+1][j+1];
}
}
my 1d array
for (k = rows + 1; k < (cols * rows) / 2; k++) {
int s;
s = 0 * image_in[k-rows-1]
- 1 * image_in[k-rows]
+ 0 * image_in[k-rows+1]
- 1 * image_in[k-1]
+ 4 * image_in[k]
- 1 * image_in[k+1]
+ 0 * image_in[k+rows-1]
- 1 * image_in[k+rows]
+ 0 * image_in[k+rows+1];
}
That should be the same but I don't know if I correctly doing transformations. Can someone tell me if that is ok?

First of all: Why do you want to get away with the 2D array? You think that 2D array dimensions must be constant? Well, in that case I have good news for you: You are wrong. This code should work perfectly:
int width = ..., height = ...;
//Create a 2D array on the heap with dynamic sizes:
int (*image_in)[width] = malloc(height * sizeof(*image_in));
//initialize the array
for(int i = 0; i < height; i++) {
for(int j = 0; j < width; j++) {
image_in[i][j] = ...;
}
}
You see, apart from the somewhat cryptic declaration of the array pointer, the indexing remains exactly the same as with an automatic 2D array on the stack.
Within your given loop, you want to address the cells relative to the center cell. This is easiest done by actually addressing relative to that cell:
for (i = 1; i < rows - 1; i++) {
for (j = 1; j < cols - 1;j++) {
int* center = &image_in[i][j];
int s = mask[0][0] * center[-width - 1]
+ mask[0][1] * center[-width]
+ mask[0][2] * center[-width + 1]
+ mask[1][0] * center[-1]
+ mask[1][1] * center[0]
+ mask[1][2] * center[1]
+ mask[2][0] * center[width - 1]
+ mask[2][1] * center[width]
+ mask[2][2] * center[width + 1];
}
}
This works because the 2D array has the same memory layout as your 1D array (this is guaranteed by the C standard).
The edge handling in a 1D loop is always wrong: It will execute the body of the loop for the first and last cells of each line. This cannot be fixed without introducing some if() statements into the loop which will significantly slow things down.
This may be ignored if the consequences are proven to be irrelevant (you still need to exclude the first and last lines plus a cell). However, the edge handling is much easier if you stick to a 2D array.

If the first part of your code gives you expected result, then you can do the same with 1d array this way :
for (i = 1; i < rows - 1; i++) {
for (j = 1; j < cols - 1;j++) {
int s;
s = mask[0][0] * image_in[i-1+rows*(j-1)]
+ mask[0][1] * image_in[i-1+rows*j]
+ mask[0][2] * image_in[i-1+rows*(j+1)]
+ mask[1][0] * image_in[i+rows*(j-1)]
+ mask[1][1] * image_in[i+rows*j]
+ mask[1][2] * image_in[i+rows*(j+1)]
+ mask[2][0] * image_in[i+1+rows*(j-1)]
+ mask[2][1] * image_in[i+1+rows*j]
+ mask[2][2] * image_in[i+1+rows*(j+1)];
}
}
This way, if you are good with 2d arrays, you can do the same without error with 1d array as if they were 2d.

Loop Splitting makes code slower

So I'm optimizing a loop (as homework) that adds 10,000 elements 600,000 times. The time without optimizations is 23.34s~ and my goal is to reach less than 7 seconds for a B and less than 5 for an A.
So I started my optimizations by first unrolling the loop like this.
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7];
This reduces the runtime to about 6.4~ seconds (I can hit about 6 if I unroll further).
So I figured I would try adding sub-sums and making a final sum at the end to save time on read-write dependencies and I came up with code that looks like this.
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum0 += array[j] + array[j+1];
sum1 += array[j+2] + array[j+3];
sum2 += array[j+4] + array[j+5];
sum3 += array[j+6] + array[j+7];
However this increases the runtime to about 6.8 seconds
I tried a similar technique using pointers and the best I could do was about 15 seconds.
I only know that the machine I'm running this on (as it is a service purchased by the school) is a 32 bit, remote, Intel based, Linux virtual server that I believe is running Red Hat.
I've tried every technique I can think of to speed up the code, but they all seem to have the opposite effect. Could someone elaborate on what I'm doing wrong? Or another technique I could use to lower the runtime? The best the teacher could do was about 4.8 seconds.
As an additional condition I cannot have more than 50 lines of code in the finished project, so doing something complex is likely not possible.
Here is a full copy of both sources
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
// double sum0 = 0;
// double sum1 = 0;
// double sum2 = 0;
// double sum3 = 0;
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - ACTUAL NAME\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7];
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
// sum = sum0 + sum1 + sum2 + sum3;
// ... and this one.
return 0;
}
Broken up code
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
double sum0 = 0;
double sum1 = 0;
double sum2 = 0;
double sum3 = 0;
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - ACTUAL NAME\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j += 8) {
sum0 += array[j] + array[j+1];
sum1 += array[j+2] + array[j+3];
sum2 += array[j+4] + array[j+5];
sum3 += array[j+6] + array[j+7];
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
sum = sum0 + sum1 + sum2 + sum3;
// ... and this one.
return 0;
}
ANSWER
The 'time' application we use to judge the grade is a little bit off. The best I could do was 4.9~ by unrolling the loop 50 times and grouping it like I did below using TomKarzes's basic format.
int j;
for (j = 0; j < ARRAY_SIZE; j += 50) {
sum +=(((((((array[j] + array[j+1]) + (array[j+2] + array[j+3])) +
((array[j+4] + array[j+5]) + (array[j+6] + array[j+7]))) +
(((array[j+8] + array[j+9]) + (array[j+10] + array[j+11])) +
((array[j+12] + array[j+13]) + (array[j+14] + array[j+15])))) +
((((array[j+16] + array[j+17]) + (array[j+18] + array[j+19]))))) +
(((((array[j+20] + array[j+21]) + (array[j+22] + array[j+23])) +
((array[j+24] + array[j+25]) + (array[j+26] + array[j+27]))) +
(((array[j+28] + array[j+29]) + (array[j+30] + array[j+31])) +
((array[j+32] + array[j+33]) + (array[j+34] + array[j+35])))) +
((((array[j+36] + array[j+37]) + (array[j+38] + array[j+39])))))) +
((((array[j+40] + array[j+41]) + (array[j+42] + array[j+43])) +
((array[j+44] + array[j+45]) + (array[j+46] + array[j+47]))) +
(array[j+48] + array[j+49])));
}

I experimented with the grouping a bit. On my machine, with my gcc, I found that the following worked best:
for (j = 0; j < ARRAY_SIZE; j += 16) {
sum = sum +
(array[j ] + array[j+ 1]) +
(array[j+ 2] + array[j+ 3]) +
(array[j+ 4] + array[j+ 5]) +
(array[j+ 6] + array[j+ 7]) +
(array[j+ 8] + array[j+ 9]) +
(array[j+10] + array[j+11]) +
(array[j+12] + array[j+13]) +
(array[j+14] + array[j+15]);
}
In other words, it's unrolled 16 times, it groups the sums into pairs, and then it adds the pairs linearly. I also removed the += operator, which affects when sum is first used in the additions.
I found that the measured times varied significantly from one run to the next, even without changing anything, so I suggest timing each version several times before making any conclusions about whether the time has improved or gotten worse.
I'd be interested to know what numbers you get on your machine with this version of the inner loop.
Update: Here's my current fastest version (on my machine, with my compiler):
int j1, j2;
j1 = 0;
do {
j2 = j1 + 20;
sum = sum +
(array[j1 ] + array[j1+ 1]) +
(array[j1+ 2] + array[j1+ 3]) +
(array[j1+ 4] + array[j1+ 5]) +
(array[j1+ 6] + array[j1+ 7]) +
(array[j1+ 8] + array[j1+ 9]) +
(array[j1+10] + array[j1+11]) +
(array[j1+12] + array[j1+13]) +
(array[j1+14] + array[j1+15]) +
(array[j1+16] + array[j1+17]) +
(array[j1+18] + array[j1+19]);
j1 = j2 + 20;
sum = sum +
(array[j2 ] + array[j2+ 1]) +
(array[j2+ 2] + array[j2+ 3]) +
(array[j2+ 4] + array[j2+ 5]) +
(array[j2+ 6] + array[j2+ 7]) +
(array[j2+ 8] + array[j2+ 9]) +
(array[j2+10] + array[j2+11]) +
(array[j2+12] + array[j2+13]) +
(array[j2+14] + array[j2+15]) +
(array[j2+16] + array[j2+17]) +
(array[j2+18] + array[j2+19]);
}
while (j1 < ARRAY_SIZE);
This uses a total unroll amount of 40, split into two groups of 20, with alternating induction variables that are pre-incremenented to break dependencies, and a post-tested loop. Again, you can experiment with the parentheses groupings to fine-tune it for your compiler and platform.

I tried your code with the following approaches:
No optimization, for loop with integer indexes by 1, simple sum +=. This took 16.4 seconds on my 64 bit 2011 MacBook Pro.
gcc -O2, same code, got down to 5.46 seconds.
gcc -O3, same code, got down to 5.45 seconds.
I tried using your code with 8-way addition into the sum variable. This took it down to 2.03 seconds.
I doubled that to 16-way additon into the sum variable, this took it down to 1.91 seconds.
I doubled that to 32-way addition into the sum variable. The time WENT UP to 2.08 seconds.
I switched to a pointer approach, as suggested by #kcraigie. With -O3, the time was 6.01 seconds. (Very surprising to me!)
register double * p;
for (p = array; p < array + ARRAY_SIZE; ++p) {
sum += *p;
}
I changed the for loop to a while loop, with sum += *p++ and got the time down to 5.64 seconds.
I changed the while loop to count down instead of up, the time went up to 5.88 seconds.
I changed back to a for loop with incrementing-by-8 integer index, added 8 register double sum[0-7] variables, and added _array[j+N] to sumN for N in [0,7]. With _array declared to be a register double *const initialized to array, on the chance that it matters. This got the time down to 1.86 seconds.
I changed to a macro that expanded to 10,000 copies of +_array[n], with N a constant. Then I did sum = tnKX(addsum) and the compiler crashed with a segmentation fault. So a pure-inlining approach isn't going to work.
I switched to a macro that expanded to 10,000 copies of sum += _array[n] with N a constant. That ran in 6.63 seconds!! Apparently the overhead of loading all that code reduces the effectiveness of the inlining.
I tried declaring a static double _array[ARRAY_SIZE]; and then using __builtin_memcpy to copy it before the first loop. With the 8-way parallel addition, this resulted in a time of 2.96 seconds. I don't think static array is the way to go. (Sad - I was hoping the constant address would be a winner.)
From all this, it seems like 16-way inlining or 8-way parallel variables should be the way to go. You'll have to try this on your own platform to make sure - I don't know what the wider architecture will do to the numbers.
Edit:
Following a suggestion from #pvg, I added this code:
int ntimes = 0;
// ... and this one.
...
// You can change anything between this comment ...
if (ntimes++ == 0) {
Which reduced the run time to < 0.01 seconds. ;-) It's a winner, if you don't get hit with the F-stick.

Optimizing neighbor count function for conway's game of life in C

Having some trouble optimizing a function that returns the number of neighbors of a cell in a Conway's Game of Life implementation. I'm trying to learn C and just get better at coding. I'm not very good at recognizing potential optimizations, and I've spent a lot of time online reading various methods but it's not really clicking for me yet.
Specifically I'm trying to figure out how to unroll this nested for loop in the most efficient way, but each time I try I just make the runtime longer.
I'm including the function, I don't think any other context is needed. Thanks for any advice you can give!
Here is the code for the countNeighbors() function:
static int countNeighbors(board b, int x, int y)
{
int n = 0;
int x_left = max(0, x-1);
int x_right = min(HEIGHT, x+2);
int y_left = max(0, y-1);
int y_right = min(WIDTH, y+2);
int xx, yy;
for (xx = x_left; xx < x_right; ++xx) {
for (yy = y_left; yy < y_right; ++yy) {
n += b[xx][yy];
}
}
return n - b[x][y];
}

Instead of declaring board as b[WIDTH][HEIGHT] declare it as b[WIDTH + 2][HEIGHT + 2]. This gives an extra margin which will have zeros, but it prevents from index out of bounds. So, instead of:
x x
x x
We will have:
0 0 0 0
0 x x 0
0 x x 0
0 0 0 0
x denotes used cells, 0 will be unused.
Typical trade off: a bit of memory for speed.
Thanks to that we don't have to call min and max functions (which have bad for performance if statements).
Finally, I would write your function like that:
int countNeighborsFast(board b, int x, int y)
{
int n = 0;
n += b[x-1][y-1];
n += b[x][y-1];
n += b[x+1][y-1];
n += b[x-1][y];
n += b[x+1][y];
n += b[x-1][y+1];
n += b[x][y+1];
n += b[x+1][y+1];
return n;
}
Benchmark (updated)
Full, working source code.
Thanks to Jongware comment I added linearization (reducing array's dimensions from 2 to 1) and changing int to char.
I also made the main loop linear and calculate the returned sum directly, without an intermediate n variable.
2D array was 10002 x 10002, 1D had 100040004 elements.
The CPU I have is Pentium Dual-Core T4500 at 2.30 GHz, further details here (output of cat /prof/cpuinfo).
Results on default optimization level O0:
Original: 15.50s
Mine: 10.13s
Linear: 2.51s
LinearAndChars: 2.48s
LinearAndCharsAndLinearLoop: 2.32s
LinearAndCharsAndLinearLoopAndSum: 1.53s
That's about 10x faster compared to the original version.
Results on O2:
Original: 6.42s
Mine: 4.17s
Linear: 0.55s
LinearAndChars: 0.53s
LinearAndCharsAndLinearLoop: 0.42s
LinearAndCharsAndLinearLoopAndSum: 0.44s
About 15x faster.
On O3:
Original: 10.44s
Mine: 1.47s
Linear: 0.26s
LinearAndChars: 0.26s
LinearAndCharsAndLinearLoop: 0.25s
LinearAndCharsAndLinearLoopAndSum: 0.24s
About 44x faster.
The last version, LinearAndCharsAndLinearLoopAndSum is:
typedef char board3[(HEIGHT + 2) * (WIDTH + 2)];
int i;
for (i = WIDTH + 3; i <= (WIDTH + 2) * (HEIGHT + 1) - 2; i++)
countNeighborsLinearAndCharsAndLinearLoopAndSum(b3, i);
int countNeighborsLinearAndCharsAndLinearLoopAndSum(board3 b, int pos)
{
return
b[pos - 1 - (WIDTH + 2)] +
b[pos - (WIDTH + 2)] +
b[pos + 1 - (WIDTH + 2)] +
b[pos - 1] +
b[pos + 1] +
b[pos - 1 + (WIDTH + 2)] +
b[pos + (WIDTH + 2)] +
b[pos + 1 + (WIDTH + 2)];
}
Changing 1 + (WIDTH + 2) to WIDTH + 3 won't help, because compiler takes care of it anyway (even on O0 optimization level).

I am writing a max filter. This replaces each pixel RGB channel with the maximum channel intensity of the surrounding 9 pixels

Here is the code I am using. When I run it, it doesn't seem to change anything in the image except the last 1/4 of it. That part turns to a solid color.
void maxFilter(pixel * data, int w, int h)
{
GLubyte tempRed;
GLubyte tempGreen;
GLubyte tempBlue;
int i;
int j;
int k;
int pnum = 0;
int pnumWrite = 0;
for(i = 0 ; i < (h - 2); i+=3) {
for(j = 0 ; j < (w - 2); j+=3) {
tempRed = 0;
tempGreen = 0;
tempBlue = 0;
for (k = 0 ; k < 3 ; k++){
if ((data[pnum].r) > tempRed){tempRed = (data[pnum + k].r);}
if ((data[pnum].g) > tempGreen){tempGreen = (data[pnum + k].g);}
if ((data[pnum].b) > tempBlue){tempBlue = (data[pnum + k].b);}
if ((data[(pnum + w)].r) > tempRed){tempRed = (data[(pnum + w)].r);}
if ((data[(pnum + w)].g) > tempGreen){tempGreen = (data[(pnum + w)].g);}
if ((data[(pnum + w)].b) > tempBlue){tempBlue = (data[(pnum + w)].b);}
if ((data[(pnum + 2 * w)].r) > tempRed){tempRed = (data[(pnum + 2 * w)].r);}
if ((data[(pnum + 2 * w)].g) > tempGreen){tempGreen = (data[(pnum + 2 * w)].g);}
if ((data[(pnum + 2 * w)].b) > tempBlue){tempBlue = (data[(pnum + 2 * w)].b);}
pnum++;
}
pnumWrite = pnum - 3;
for (k = 0 ; k < 3 ; k++){
((data[pnumWrite].r) = tempRed);
((data[pnumWrite].g) = tempGreen);
((data[pnumWrite].b) = tempBlue);
((data[(pnumWrite + w)].r) = tempRed);
((data[(pnumWrite + w)].g) = tempGreen);
((data[(pnumWrite + w)].b) = tempBlue);
((data[(pnumWrite + 2 * w)].r) = tempRed);
((data[(pnumWrite + 2 * w)].g) = tempGreen);
((data[(pnumWrite + 2 * w)].b) = tempBlue);
pnumWrite++;
}
}
}
}

I can see several problems with that code - being difficult to follow not being the least!
I think your main problem is that the loop is (as you probably intended) run through h/3 * w/3 times, once for each 3x3 block in the image. But the pnum index runs only increases by 3 for each block, and reaches a maximum of about h*w/3, rather than the intended h*w. That means that only the first third of your image will be affected by your filter. (And I suspect your painting is done 'bottom-up', so that's why you see the lowest part change. I remember .bmp files being structured that way, but perhaps there are others as well.)
The 'cheap' fix would be to add 2*w at the right point, but nobody will ever understand that code again. I suggest you rewrite your indexing instead, and explicitly compute pnum from i and j in each turn through the loop. That can be improved on for readability, but is reasonably clear.
There's another minor thing: you have code like
if ((data[pnum].r) > tempRed){tempRed = (data[pnum + k].r);}
where the indexing on the right and on the left differ: this is probably also giving you results different from what you intended.
As Jongware points out, writing to the input array is always dangerous - your code is intended, I believe, to avoid that problem by only looking once into each 3x3 block, but his suggestion of a separate output array is very sensible - you probably don't want the blockiness your code gives anyway (you make each 3x3 block all one colour, don't you?), and his suggestion would let you avoid that.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight